tika package#
Submodules#
tika.config module#
- async tika.config.get_parsers()[source]#
Retrieves the list of available parsers from the Tika server.
Fetches detailed information about all parsers supported by the Tika server, including their supported MIME types and parser properties.
- Return type:
- Returns:
A response containing the parser configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.
- Raises:
TikaError – If the server request fails or returns an error status.
Example
>>> parsers = get_parsers() >>> print(parsers) # Prints JSON of available parsers and their capabilities
- async tika.config.get_mime_types()[source]#
Retrieves the list of supported MIME types from the Tika server.
Fetches the complete list of MIME types that the Tika server can handle, including file extensions and type hierarchies.
- Return type:
- Returns:
A response containing the MIME type configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.
- Raises:
TikaError – If the server request fails or returns an error status.
Example
>>> mime_types = get_mime_types() >>> print(mime_types) # Prints JSON of supported MIME types
- async tika.config.get_detectors()[source]#
Retrieves the list of available content detectors from the Tika server.
Fetches information about all content type detectors supported by the Tika server, including their detection capabilities and priorities.
- Return type:
- Returns:
A response containing the detector configuration, typically as JSON. The return type matches the server response format, which may be str, bytes, or BinaryIO.
- Raises:
TikaError – If the server request fails or returns an error status.
Example
>>> detectors = get_detectors() >>> print(detectors) # Prints JSON of available detectors
tika.core module#
Tika Python module provides Python API client to Apache Tika Server.
Example usage:
import tika
from tika import parser
parsed = parser.from_file('/path/to/file')
print(parsed["metadata"])
print(parsed["content"])
Visit https://github.com/chrismattmann/tika-python to learn more about it.
Detect IANA MIME Type:
from tika import detector
print(detector.from_file('/path/to/file'))
Detect Language:
from tika import language
print(language.from_file('/path/to/file'))
*Tika-Python Configuration* You can now use custom configuration files. See https://tika.apache.org/1.18/configuring.html for details on writing configuration files. Configuration is set the first time the server is started. To use a configuration file with a parser, or detector:
parsed = parser.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)
- or:
detected = detector.from_file(‘/path/to/file’, config_path=’/path/to/configfile’)
- or:
detected = detector.from_buffer(‘some buffered content’, config_path=’/path/to/configfile’)
- async tika.core.run_command(cmd, option, url_or_paths, port, out_dir=None, server_host='localhost', tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), verbose=0, encode=0)[source]#
Execute a Tika command by calling the Tika server and return results.
- Parameters:
cmd (
str
) – The command to execute. Must be one of: ‘parse’, ‘detect’, ‘language’, ‘translate’, or ‘config’.option (
str
) – Command-specific option that modifies the behavior (e.g., ‘all’ for parse command).url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – One or more files to process, specified as URLs, file paths, or file-like objects.port (
str
) – The port number where Tika server is running.out_dir (
Path
|None
) – Optional directory path where output files should be saved. Defaults to None.server_host (
str
) – The hostname where Tika server is running. Defaults to SERVER_HOST.tika_server_jar (
Path
) – Path to the Tika server JAR file. Defaults to TIKA_SERVER_JAR.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.encode (
int
) – Whether to encode response in UTF-8. Defaults to ENCODE_UTF8.
- Returns:
For ‘parse’ with out_dir: List of Path objects for created metadata files
For ‘parse’ without out_dir: List of tuples containing (status_code, response)
For other commands: String, bytes or file-like object containing the response
- Return type:
Depending on the command
- Raises:
TikaError – If no URLs/paths are specified for parse/detect commands or if command is unknown.
- tika.core.get_paths(url_or_paths)[source]#
Convert URLs, file paths, or file-like objects into a list of Path objects.
Handles single paths, lists of paths, and directories. For directories, recursively finds all files within them.
- Parameters:
url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – Input paths as URLs, file paths, directories, or file-like objects. Can be a single item or an iterable of items.- Returns:
List of Path objects for all files found.
- Return type:
list[Path]
Note
When a directory is provided, all files within it (including in subdirectories) are included in the returned list.
- async tika.core.parse_and_save(option, url_or_paths, *, out_dir=None, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', meta_extension='_meta.json', services=None)[source]#
Parse files and save extracted metadata/text as JSON files.
For each input file, creates a corresponding metadata file with the specified extension. The metadata files contain the extracted information in JSON format.
- Parameters:
option (
str
) – Parsing option (‘meta’, ‘text’, or ‘all’).url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – Files to parse as URLs, paths, or file-like objects.out_dir (
Path
|None
) – Directory where metadata files should be saved. If None, saves alongside input files. Defaults to None.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “application/json”.meta_extension (
str
) – Extension to append to metadata filenames. Defaults to “_meta.json”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.
- Returns:
List of paths to the created metadata files.
- Return type:
list[Path]
Note
For each input file ‘example.pdf’, creates ‘example.pdf_meta.json’ (or similar based on meta_extension) containing the extracted information.
- async tika.core.parse(option, url_or_paths, *, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, raw_response=False)[source]#
Parse files and extract metadata and/or text content using Tika.
- Parameters:
option (
str
) – Parsing option (‘meta’, ‘text’, or ‘all’).url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – Files to parse as URLs, paths, or file-like objects.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “application/json”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.raw_response (
bool
) – If True, return raw response content. Defaults to False.
- Returns:
List of tuples containing (HTTP status code, parsed content) for each processed file.
- Return type:
- async tika.core.parse_1(option, url_or_path, *, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, raw_response=False, headers=None, config_path=None, request_options=None)[source]#
Parse a single file and extract metadata and/or text content using Tika.
- Parameters:
option (
str
) – Parsing option (‘meta’, ‘text’, or ‘all’).url_or_path (
str
|Path
|BinaryIO
) – File to parse as URL, path, or file-like object.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “application/json”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘meta’: ‘/meta’, ‘text’: ‘/tika’, ‘all’: ‘/rmeta’}.raw_response (
bool
) – If True, return raw response content. Defaults to False.headers (
dict
[str
,Any
] |None
) – Additional HTTP headers for request. Defaults to None.config_path (
str
|None
) – Path to Tika config file. Defaults to None.request_options (
dict
[str
,Any
] |None
) – Additional request options. Defaults to None.
- Returns:
Tuple containing HTTP status code and parsed content.
- Return type:
- async tika.core.detect_lang(option, url_or_paths, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None)[source]#
Detect the language of files using Tika.
- Parameters:
option (
str
) – Detection option (usually ‘file’).url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – Files to analyze as URLs, paths, or file-like objects.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “text/plain”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘file’: ‘/language/stream’}.
- Returns:
List of tuples containing (HTTP status code, detected language code) for each file.
- Return type:
Note
Language codes are returned as ISO 639-1 two-letter codes (e.g., ‘en’ for English).
- async tika.core.detect_lang_1(option, url_or_path, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None, request_options=None)[source]#
Detect the language of a single file using Tika.
- Parameters:
option (
str
) – Detection option (usually ‘file’).url_or_path (
str
|Path
|BinaryIO
) – File to analyze as URL, path, or file-like object.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “text/plain”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘file’: ‘/language/stream’}.request_options (
dict
[str
,Any
] |None
) – Additional request options. Defaults to None.
- Returns:
Tuple containing HTTP status code and detected language code.
- Return type:
- Raises:
TikaError – If the specified option is not valid.
- async tika.core.detect_type(option, url_or_paths, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None)[source]#
Detect MIME types of files using Tika.
- Parameters:
option (
str
) – Detection option (usually ‘type’).url_or_paths (
Iterable
[str
|Path
|BinaryIO
]) – Files to analyze as URLs, paths, or file-like objects.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “text/plain”.services (
dict
[str
,str
] |None
) – Dict mapping options to service endpoints. Defaults to {‘type’: ‘/detect/stream’}.
- Returns:
List of tuples containing (HTTP status code, detected MIME type) for each file.
- Return type:
- async tika.core.detect_type_1(option, url_or_path, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='text/plain', services=None, config_path=None, request_options=None)[source]#
Detect MIME type of a single file using Tika.
- Parameters:
option (
str
) – Detection option (usually ‘type’).url_or_path (
str
|Path
|BinaryIO
) – File to analyze as URL, path, or file-like object.server_endpoint (
str
) – Tika server URL. Defaults to SERVER_ENDPOINT.verbose (
int
) – Logging verbosity level. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected response format. Defaults to “text/plain”.services (
dict
[str
,Any
] |None
) – Dict mapping options to service endpoints. Defaults to {‘type’: ‘/detect/stream’}.config_path (
str
|None
) – Path to Tika config file. Defaults to None.request_options (
dict
[str
,Any
] |None
) – Additional request options. Defaults to None.
- Returns:
Tuple containing HTTP status code and detected MIME type.
- Return type:
- Raises:
TikaError – If the specified option is not valid.
- async tika.core.get_config(option, server_endpoint='http://localhost:9998', verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), response_mime_type='application/json', services=None, request_options=None)[source]#
Retrieves configuration information from the Tika server.
Makes a GET request to the Tika server to fetch configuration details about various server capabilities including parsers, detectors, and MIME types.
- Parameters:
option (
str
) – The configuration to retrieve. Must be one of: - “mime-types”: List of supported MIME types - “detectors”: Available content type detectors - “parsers”: Available document parsersserver_endpoint (
str
) – URL of the Tika server. Defaults to SERVER_ENDPOINT.verbose (
int
) – Level of logging verbosity. Defaults to VERBOSE.tika_server_jar (
Path
) – Path to the Tika server JAR file. Defaults to TIKA_SERVER_JAR.response_mime_type (
str
) – Expected MIME type of the response. Defaults to “application/json”.services (
dict
[str
,str
] |None
) –Optional dictionary mapping config options to their service endpoints. Defaults to: {
”mime-types”: “/mime-types”, “detectors”: “/detectors”, “parsers”: “/parsers/details”
}
request_options (
dict
[str
,Any
] |None
) – Optional dictionary of additional request options.
- Returns:
HTTP status code (int)
Server response (str, bytes, or BinaryIO) containing the requested configuration
- Return type:
A tuple containing
- Raises:
TikaError – If the server returns an error status
ValueError – If an invalid option is specified
RuntimeError – If the server cannot be contacted
Example
>>> status, config = get_config("parsers") >>> if status == 200: ... print(config) # Print parser configuration
- tika.core.get_async_client()[source]#
Returns a memoized httpx.AsyncClient instance. Creates a new one if it doesn’t exist.
- Return type:
AsyncClient
- async tika.core.close_async_client()[source]#
Closes the memoized client if it exists. Should be called when shutting down the application.
- Return type:
- async tika.core.call_server(verb, server_endpoint, service, data, *, headers, verbose=0, tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), classpath=None, raw_response=False, config_path=None, request_options=None)[source]#
Make an HTTP request to the Tika Server.
- tika.core.check_tika_server(scheme='http', server_host='localhost', port='9998', tika_server_jar=PosixPath('/home/runner/work/tika-python/tika-python/src/tika/jars/tika-server-standard-3.0.0.jar'), classpath=None, config_path=None)[source]#
Check if Tika server is running and start it if necessary.
- Parameters:
scheme (
Literal
['http'
,'https'
]) – Protocol to use (‘http’ or ‘https’). Defaults to “http”.server_host (
str
) – Host where server should run. Defaults to SERVER_HOST.port (
str
) – Port for the server. Defaults to PORT.tika_server_jar (
Path
) – Path to Tika server JAR. Defaults to TIKA_SERVER_JAR.classpath (
str
|None
) – Additional classpath entries. Defaults to None.config_path (
str
|None
) – Path to Tika configuration file. Defaults to None.
- Returns:
Server endpoint URL (e.g., “http://localhost:9998”)
- Return type:
- Raises:
RuntimeError – If server JAR signature doesn’t match or server fails to start.
Note
Only attempts to start server for localhost or 127.0.0.1 addresses. For remote servers, just returns the endpoint URL.
- tika.core.check_jar_signature(tika_server_jar, jar_path)[source]#
Checks the signature of Jar :type tika_server_jar:
Path
:param tika_server_jar: :param jarPath: :rtype:bool
:return:True
if the signature of the jar matches
- tika.core.start_server(tika_server_jar, java_path='java', java_args='', server_host='localhost', port='9998', classpath=None, config_path=None)[source]#
Start the Tika Server as a subprocess.
- Parameters:
tika_server_jar (
Path
) – Path to the Tika server JAR file.java_path (
str
) – Path to Java executable. Defaults to TIKA_JAVA.java_args (
str
) – Additional Java arguments. Defaults to TIKA_JAVA_ARGS.server_host (
str
) – Host interface address for binding. Defaults to SERVER_HOST.port (
str
) – Port number for the server. Defaults to PORT.classpath (
str
|None
) – Additional classpath entries. Defaults to None.config_path (
str
|None
) – Path to Tika configuration file. Defaults to None.
- Returns:
True if server started successfully, False otherwise.
- Return type:
Note
Creates a log file at TIKA_SERVER_LOG_FILE_PATH/tika-server.log
On Windows, forces server_host to “0.0.0.0”
Attempts to start server multiple times based on TIKA_STARTUP_MAX_RETRY
Sets global TIKA_SERVER_PROCESS variable for later cleanup
- tika.core.kill_server(tika_server_process=None, *, is_windows=False)[source]#
Kill the running Tika server process.
- Parameters:
- Return type:
Note
On Windows, uses SIGTERM signal directly
On Unix-like systems, kills the process group
Waits 1 second after sending kill signal
Logs errors if process cannot be killed
- tika.core.get_remote_file(url_or_path, dest_path)[source]#
Fetch a remote file or handle a local file/binary stream.
- Parameters:
- Returns:
- Tuple containing:
Path object pointing to the local file
String indicating the source type: “local” for local files, “remote” for downloaded files, “binary” for binary streams
- Return type:
tuple[Path, Literal[“local”, “remote”, “binary”]]
- Raises:
Note
For binary stream inputs, a temporary file is created with a timestamp-based name.
- tika.core.check_port_is_open(remote_server_host='localhost', port='9998')[source]#
Check if a specific port is open on the given host.
- Parameters:
- Returns:
True if the port is open and accepting connections, False otherwise.
- Return type:
Note
- This function will exit the program if:
There is a keyboard interrupt
The hostname cannot be resolved
There are connection issues with the server
tika.detector module#
- async tika.detector.from_file(file_obj, *, config_path=None, request_options=None)[source]#
Detects the MIME type of a file using Apache Tika server.
Analyzes the file content to determine its MIME type (media type) using Tika’s detection capabilities. This is more reliable than extension-based detection.
- Parameters:
file_obj (
str
|Path
|BinaryIO
) – The file to analyze. Can be: - str: A file path or URL - Path: A pathlib.Path object pointing to the file - BinaryIO: A file-like object in binary read modeconfig_path (
str
|None
) – Optional path to a custom Tika configuration file.request_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Return type:
- Returns:
The detected MIME type (e.g., ‘application/pdf’, ‘image/jpeg’). Return type matches the server response, which may be str, bytes, or BinaryIO.
- Raises:
TikaError – If the server returns an unsuccessful status code or if type detection fails.
FileNotFoundError – If the specified file does not exist.
Example
>>> from pathlib import Path >>> mime_type = from_file(Path("document.pdf")) >>> print(mime_type) # Prints 'application/pdf' >>> mime_type = from_file("image.jpg") >>> print(mime_type) # Prints 'image/jpeg'
- async tika.detector.from_buffer(buf, *, config_path=None, request_options=None)[source]#
Detects the MIME type of content provided in a buffer using Apache Tika server.
Analyzes the buffered content to determine its MIME type (media type) using Tika’s detection capabilities. Useful for content that hasn’t been saved to a file or for streaming data.
- Parameters:
buf (
str
|bytes
|BinaryIO
) – The content to analyze. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object containing binary contentconfig_path (
str
|None
) – Optional path to a custom Tika configuration file.request_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Return type:
- Returns:
The detected MIME type (e.g., ‘application/pdf’, ‘text/plain’). Return type matches the server response, which may be str, bytes, or BinaryIO.
- Raises:
Example
>>> with open("document.pdf", "rb") as f: ... mime_type = from_buffer(f.read()) >>> print(mime_type) # Prints 'application/pdf'
>>> text_content = "Hello, world!" >>> mime_type = from_buffer(text_content) >>> print(mime_type) # Prints 'text/plain'
tika.language module#
- async tika.language.from_file(file_obj, request_options=None)[source]#
Detects the language of a file using Apache Tika server.
Uses Tika’s language detection capabilities to identify the primary language of text content within a file.
- Parameters:
file_obj (
str
|Path
|BinaryIO
) – The file to analyze. Can be: - str: A string path to the file - Path: A pathlib.Path object pointing to the file - BinaryIO: A file-like object in binary read moderequest_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Return type:
- Returns:
The detected language code (e.g., ‘en’ for English, ‘fr’ for French). Return type matches the server response, which may be str, bytes, or BinaryIO.
- Raises:
TikaError – If the server returns an unsuccessful status code or if language detection fails.
FileNotFoundError – If the specified file does not exist.
Example
>>> from pathlib import Path >>> language = from_file(Path("document.txt")) >>> print(language) # Prints 'en' for English text
- async tika.language.from_buffer(buf, request_options=None)[source]#
Detects the language of content provided in a buffer using Apache Tika server.
Sends the buffered content directly to Tika’s language detection service to identify the primary language of the text.
- Parameters:
buf (
str
|bytes
|BinaryIO
) – The content to analyze. Can be: - str: Text content as a string - bytes: Binary content - BinaryIO: File-like object containing contentrequest_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Return type:
- Returns:
The detected language code (e.g., ‘en’ for English, ‘fr’ for French). Return type matches the server response, which may be str, bytes, or BinaryIO.
- Raises:
Example
>>> text = "Bonjour le monde!" >>> language = from_buffer(text) >>> print(language) # Prints 'fr' for French text
tika.parser module#
- async tika.parser.from_file(obj, *, server_endpoint='http://localhost:9998', service='all', xml_content=False, headers=None, config_path=None, request_options=None)[source]#
Parses a file using Apache Tika server and returns structured content and metadata.
This function sends a file to the Tika server for parsing using the specified service and configuration options. It can handle local files, URLs, or binary streams.
- Parameters:
obj (
str
|Path
|BinaryIO
) – The file to be parsed. Can be: - str: A file path or URL - Path: A pathlib.Path object pointing to a file - BinaryIO: A file-like object in binary read modeserver_endpoint (
str
) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.service (
str
) – The Tika service to use. Must be one of: - “all”: Both content and metadata (default) - “meta”: Only metadata - “text”: Only text contentxml_content (
bool
) – If True, requests XML output instead of plain text. This affects how the content is structured in the response.headers (
dict
[str
,Any
] |None
) – Additional HTTP headers to include in the request.config_path (
str
|None
) – Path to a custom Tika configuration file.request_options (
dict
[str
,Any
] |None
) – Additional options for the HTTP request (e.g., timeout).
- Returns:
- A dictionary-like object containing:
content: Extracted text or XML content (str or None)
metadata: Dictionary of document metadata (dict or None)
status: HTTP status code (int)
attachments: Any embedded files (dict or None)
- Return type:
- Raises:
TikaError – If the server returns an error or parsing fails
FileNotFoundError – If the specified file doesn’t exist
ValueError – If an invalid service type is specified
Example
>>> response = from_file("document.pdf", service="all") >>> print(response.content) # Print extracted text >>> print(response.metadata.get("Content-Type")) # Get document type
- async tika.parser.from_buffer(buf, *, server_endpoint='http://localhost:9998', xml_content=False, headers=None, config_path=None, request_options=None)[source]#
Parses content directly from a buffer using Apache Tika server.
This function sends buffered content to the Tika server for parsing and returns structured content and metadata. It automatically uses the /rmeta endpoint for either text or XML output.
- Parameters:
buf (
str
|bytes
|BinaryIO
) – The content to parse. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object with binary contentserver_endpoint (
str
) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.xml_content (
bool
) – If True, requests XML output instead of plain text. Affects the structure of the returned content.headers (
dict
[str
,Any
] |None
) – Additional HTTP headers to include in the request. ‘Accept: application/json’ is automatically added.config_path (
str
|None
) – Path to a custom Tika configuration file.request_options (
dict
[str
,Any
] |None
) – Additional options for the HTTP request (e.g., timeout).
- Returns:
- A dictionary-like object containing:
content: Extracted text or XML content (str or None)
metadata: Dictionary of document metadata (dict or None)
status: HTTP status code (int)
attachments: Any embedded files (dict or None)
- Return type:
- Raises:
Example
>>> with open("document.pdf", "rb") as f: ... response = from_buffer(f.read()) >>> print(response.metadata) # Print all metadata
tika.unpack module#
- async tika.unpack.from_file(file_obj, *, server_endpoint='http://localhost:9998', request_options=None)[source]#
Parses a file using Apache Tika server’s unpack endpoint.
This function sends the provided file to a Tika server for parsing and returns the extracted content, metadata, and attachments.
- Parameters:
file_obj (
Path
) – A Path object pointing to the file to be parsed.server_endpoint (
str
) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.request_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Returns:
- A dictionary-like object containing:
content: The extracted text content (str)
metadata: Dictionary of metadata key-value pairs
attachments: Dictionary of embedded files
status: HTTP status code of the response
- Return type:
- Raises:
TikaError – If the server returns an unsuccessful status code or if parsing fails.
FileNotFoundError – If the specified file does not exist.
Example
>>> from pathlib import Path >>> response = from_file(Path("document.pdf")) >>> print(response.content) # Print extracted text >>> print(response.metadata) # Print document metadata
- async tika.unpack.from_buffer(buf, *, server_endpoint='http://localhost:9998', headers=None, request_options=None)[source]#
Parses content directly from a buffer using Apache Tika server’s unpack endpoint.
This function sends buffered content (string, bytes, or file-like object) to a Tika server for parsing and returns the extracted content, metadata, and attachments.
- Parameters:
buf (
str
|bytes
|BinaryIO
) – The content to be parsed. Can be: - str: Text content - bytes: Binary content - BinaryIO: File-like object containing binary contentserver_endpoint (
str
) – The URL of the Tika server. Defaults to SERVER_ENDPOINT.headers (
dict
[str
,Any
] |None
) – Optional dictionary of additional HTTP headers to send with the request. The ‘Accept: application/x-tar’ header will be added automatically.request_options (
dict
[str
,Any
] |None
) – Optional dictionary of request options to pass to the server. Can include parameters like timeout, headers, etc.
- Returns:
- A dictionary-like object containing:
content: The extracted text content (str)
metadata: Dictionary of metadata key-value pairs
attachments: Dictionary of embedded files
status: HTTP status code of the response
- Return type:
- Raises:
Example
>>> with open("document.pdf", "rb") as f: ... response = from_buffer(f.read()) >>> print(response.metadata.get("Content-Type"))
Module contents#
- tika.start_server(tika_server_jar, java_path='java', java_args='', server_host='localhost', port='9998', classpath=None, config_path=None)[source]#
Start the Tika Server as a subprocess.
- Parameters:
tika_server_jar (
Path
) – Path to the Tika server JAR file.java_path (
str
) – Path to Java executable. Defaults to TIKA_JAVA.java_args (
str
) – Additional Java arguments. Defaults to TIKA_JAVA_ARGS.server_host (
str
) – Host interface address for binding. Defaults to SERVER_HOST.port (
str
) – Port number for the server. Defaults to PORT.classpath (
str
|None
) – Additional classpath entries. Defaults to None.config_path (
str
|None
) – Path to Tika configuration file. Defaults to None.
- Returns:
True if server started successfully, False otherwise.
- Return type:
Note
Creates a log file at TIKA_SERVER_LOG_FILE_PATH/tika-server.log
On Windows, forces server_host to “0.0.0.0”
Attempts to start server multiple times based on TIKA_STARTUP_MAX_RETRY
Sets global TIKA_SERVER_PROCESS variable for later cleanup
- tika.kill_server(tika_server_process=None, *, is_windows=False)[source]#
Kill the running Tika server process.
- Parameters:
- Return type:
Note
On Windows, uses SIGTERM signal directly
On Unix-like systems, kills the process group
Waits 1 second after sending kill signal
Logs errors if process cannot be killed