tikara#

Main package entrypoint for Tikara.

Submodules#

Exceptions#

TikaError

Base class for all exceptions raised by Tikara.

Classes#

`Tika`	The main entrypoint class. Wraps management of the underlying Tika and JVM instances.
`TikaDetectLanguageResult`	Represents the result of a language detection operation.
`TikaLanguageConfidence`	Enum representing the confidence level of a detected language result.
`TikaMetadata`	Normalized metadata from Tika document processing with standardized field names.
`TikaUnpackedItem`	Individual unpacked embedded document.
`TikaUnpackResult`	Result of unpacking a document with embedded files.

Package Contents#

class tikara.Tika(*, lazy_load: bool = True, custom_parsers: list[org.apache.tika.parser.Parser] | collections.abc.Callable[[], list[org.apache.tika.parser.Parser]] | None = None, custom_detectors: list[org.apache.tika.detect.Detector] | collections.abc.Callable[[], list[org.apache.tika.detect.Detector]] | None = None, custom_mime_types: list[str] | None = None, extra_jars: list[pathlib.Path] | None = None, tika_jar_override: pathlib.Path | None = None)[source]#

The main entrypoint class. Wraps management of the underlying Tika and JVM instances.

Initialize a new Tika wrapper instance.

This class provides a Python interface to Apache Tika’s content detection, extraction and language detection capabilities. It manages JVM initialization and Tika configuration including custom parsers, detectors and MIME types.

Parameters:

lazy_load – Whether to load the JVM, Tika classes and language models lazily on first use. Defaults to True. If False, the JVM is initialized immediately. This can improve startup time for small tasks. Note that the JVM is always shut down when the Tika instance is deleted.
custom_parsers – Custom parsers to add to the Tika pipeline. Can be either a list of Parser instances or a callable that returns such a list. Defaults to None.
custom_detectors – Custom detectors to add to the Tika pipeline. Can be either a list of Detector instances or a callable that returns such a list. Defaults to None.
custom_mime_types – Additional MIME types to register with Tika. Must be in format “type/subtype”. Defaults to None. Required when adding custom parsers/detectors that handle new MIME types.
extra_jars – Additional JAR files to add to the JVM classpath. Useful for custom parsers/detectors. Defaults to None.
tika_jar_override – Path to custom Tika JAR file to use instead of bundled version. Defaults to None.

Raises:

ValueError – If a custom MIME type is malformed (incorrect format).
FileNotFoundError – If specified JAR files don’t exist.

Examples

Basic usage:

from tikara import Tika
tika = Tika()
mime_type = tika.detect_mime_type("document.pdf")
content, metadata = tika.parse("document.pdf")

With custom parser:

from custom_parser import MarkdownParser
tika = Tika(
    custom_parsers=[MarkdownParser()],
    custom_mime_types=["text/markdown"]
... )

With custom detector:

from custom_detector import MarkdownDetector
tika = Tika(
    custom_detectors=[MarkdownDetector()],
    custom_mime_types=["text/markdown"]
... )

Notes

Custom parsers and detectors must implement the respective Java interfaces from Apache Tika.
See examples/custom_parser.ipynb and examples/custom_detector.ipynb for implementation details.
The JVM is initialized on first instantiation. Subsequent instances reuse the same JVM.
Custom MIME types must be registered when adding custom parsers/detectors for new formats.
Language detection models are loaded lazily by default to improve startup time.

See also

examples/parsing.ipynb: Examples of content extraction
examples/detect_mime_type.ipynb: Examples of MIME type detection
examples/detect_language.ipynb: Examples of language detection
examples/custom_parser.ipynb: Custom parser implementation
examples/custom_detector.ipynb: Custom detector implementation

detect_mime_type(obj: tikara.data_types.TikaInputType) → str[source]#

Detect the MIME type of a file, bytes, or stream.

Uses Apache Tika’s MIME type detection capabilities which combine file extension examination, magic bytes analysis, and content inspection. For best results when using streams/bytes, provide content from the beginning of the file since magic byte detection examines file headers.

Parameters:

obj – Input to detect MIME type for. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

Returns:

Detected MIME type in format “type/subtype” (e.g. “application/pdf”)

Return type:

str

Raises:

TypeError – If input type is not supported
FileNotFoundError – If input file path does not exist
ValueError – If detection fails

Examples

Path input:

tika = Tika()
tika.detect_mime_type("document.pdf")

Bytes input:

with open("document.pdf", "rb") as f:
    tika.detect_mime_type(f.read())

Stream input:

from io import BytesIO
bio = BytesIO(b"<html><body>Hello</body></html>")
tika.detect_mime_type(bio)

Notes

Supports all >1600 MIME types recognized by Apache Tika
Custom MIME types can be added via custom detectors
For reliable detection, provide at least 1KB of content when using bytes/streams
Detection order: custom detectors -> default Tika detectors

See also

examples/detect_mime_type.ipynb: More detection examples
examples/custom_detector.ipynb: Adding custom MIME type detection

detect_language(content: str) → tikara.data_types.TikaDetectLanguageResult[source]#

Detect the natural language of text content using Apache Tika’s language detection.

Uses statistical language detection models to identify the most likely language. Higher confidence and raw scores indicate more reliable detection. For best results, provide at least 50 characters of text.

Parameters:

content – Text content to analyze. Should be plain text, not markup/code.

Returns:

language: ISO 639-1 language code (e.g. “en” for English) confidence: Qualitative confidence level (HIGH/MEDIUM/LOW/NONE) raw_score: Numeric confidence score between 0 and 1

Return type:

TikaDetectLanguageResult with fields

Raises:

ValueError – If content is empty
RuntimeError – If language detection fails

Examples

High confidence detection:

tika = Tika()
result = tika.detect_language("The quick brown fox jumps over the lazy dog")
result.language
'en'
result.confidence
TikaLanguageConfidence.HIGH
result.raw_score
0.999

Lower confidence example:

result = tika.detect_language("123")
result.confidence
TikaLanguageConfidence.LOW

Other languages:

tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso").language
'es'

Notes

Models are loaded lazily on first use unless lazy_load=False in constructor
Supports ~70 languages including all major European and Asian languages
Short or ambiguous content may result in lower confidence scores
Language models are memory-intensive; loaded models persist until JVM shutdown

See also

examples/detect_language.ipynb: Additional language detection examples

unpack(obj: tikara.data_types.TikaInputType, output_dir: pathlib.Path, *, max_depth: int = 1, input_file_name: str | pathlib.Path | None = None, content_type: str | None = None) → tikara.data_types.TikaUnpackResult[source]#

Extract embedded documents from a container document recursively.

Extracts and saves embedded documents (e.g. images in PDFs, files in Office documents) to disk. Can recursively extract from nested containers up to specified depth.

Parameters:

obj – Input container document to extract from. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode
output_dir – Directory to save extracted documents to. Created if doesn’t exist.
max_depth – Maximum recursion depth for nested containers. Default 1 extracts only top-level embedded docs.
input_file_name – Original filename if obj is bytes/stream. Helps with metadata extraction and also helps name the output of the root file in the output_dir. Only necessary if obj is bytes or stream.
content_type – MIME type of input if known. Helps with metadata extraction.

Returns:

root_metadata: Metadata of the root document embedded_documents: List of TikaUnpackedItem objects representing extracted files

Return type:

TikaUnpackResult with fields

Raises:

FileNotFoundError – If input file path doesn’t exist
ValueError – If input type not supported
RuntimeError – If extraction fails

Examples

tika = Tika()
items = tika.unpack("presentation.pptx", Path("extracted/"))
for item in items:
    print(f"Found {item.metadata['Content-Type']} at {item.file_path}")
Found image/png at extracted/image1.png
Found application/pdf at extracted/embedded.pdf

Notes

Creates output_dir if it doesn’t exist
Handles nested containers (ZIP, PDF, Office docs etc)
Extracts images, attachments, embedded files
Returns paths are relative to output_dir
Metadata includes content type, relations, properties
Extraction depth measured from input document
For streams/bytes, provide filename/type if possible

See also

examples/unpack.ipynb: Additional extraction examples
RecursiveEmbeddedDocumentExtractor: Core extraction logic

parse(obj: tikara.data_types.TikaInputType, *, output_format: tikara.data_types.TikaParseOutputFormat = 'xhtml', input_file_name: str | pathlib.Path | None = None, content_type: str | None = None) → tuple[str, tikara.data_types.TikaMetadata][source]#

parse(obj: tikara.data_types.TikaInputType, *, output_file: pathlib.Path | str, output_format: tikara.data_types.TikaParseOutputFormat = 'xhtml', input_file_name: str | pathlib.Path | None = None, content_type: str | None = None) → tuple[pathlib.Path, tikara.data_types.TikaMetadata]

parse(obj: tikara.data_types.TikaInputType, *, output_stream: bool, output_format: tikara.data_types.TikaParseOutputFormat = 'xhtml', input_file_name: str | pathlib.Path | None = None, content_type: str | None = None) → tuple[BinaryIO, tikara.data_types.TikaMetadata]

Extract text content and metadata from documents.

Uses Apache Tika’s parsing capabilities to extract plain text or structured content from documents, along with metadata. Supports multiple input and output formats.

Parameters:

obj – Input to parse. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode
output_stream – Whether to return content as a stream instead of string
output_format – Format for extracted text: - “txt”: Plain text without markup - “xhtml”: Structured XML with text formatting (default)
output_file – Save content to this path instead of returning it
input_file_name – Original filename if obj is bytes/stream
content_type – MIME type of input if known

Returns:

Content (type depends on output mode):
- String if no output_file/output_stream
- Path if output_file specified
- BinaryIO if output_stream=True
Dict of metadata about the document

Return type:

Tuple containing

Raises:

ValueError – If output_file needed but not provided
FileNotFoundError – If input file doesn’t exist
TypeError – If input type not supported

Examples

Basic text extraction:

tika = Tika()
content, meta = tika.parse("report.pdf")
print(f"Title: {meta.get('title')}")
print(content[:100])  # First 100 chars

Stream output:

content, meta = tika.parse(
    "large.pdf",
    output_stream=True,
    output_format="txt"
... )
for line in content:
    process(line)

Save to file:

path, meta = tika.parse(
    "input.docx",
    output_file="extracted.txt",
    output_format="txt"
... )

Parse bytes with hints:

with open("doc.pdf", "rb") as f:
    content, meta = tika.parse(
        f.read(),
        input_file_name="doc.pdf",
        content_type="application/pdf"
    )

Notes

“xhtml” format preserves document structure
“txt” format gives clean plain text
Handles >1600 file formats
More accurate with filename/type hints
Streams good for large files
Metadata includes standard Dublin Core fields

See also

examples/parsing.ipynb: More parsing examples

class tikara.TikaDetectLanguageResult(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Represents the result of a language detection operation.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class tikara.TikaLanguageConfidence[source]#

Bases: enum.StrEnum

Enum representing the confidence level of a detected language result.

Initialize self. See help(type(self)) for accurate signature.

class tikara.TikaMetadata(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Normalized metadata from Tika document processing with standardized field names.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class tikara.TikaUnpackedItem(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Individual unpacked embedded document.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

class tikara.TikaUnpackResult(/, **data: Any)[source]#

Bases: pydantic.BaseModel

Result of unpacking a document with embedded files.

Create a new model by parsing and validating input data from keyword arguments.

Raises [ValidationError][pydantic_core.ValidationError] if the input data cannot be validated to form a valid model.

self is explicitly positional-only to allow self as a field name.

exception tikara.TikaError[source]#

Bases: Exception

Base class for all exceptions raised by Tikara.

Initialize self. See help(type(self)) for accurate signature.