tikara package#

Subpackages#

Submodules#

tikara.core module#

Contains the core Tika entrypoint. Re-exported from tikara so no need to import anything from here externally.

class tikara.core.Tika(*, lazy_load: bool = True, custom_parsers: list[Parser] | Callable[[], list[Parser]] | None = None, custom_detectors: list[Detector] | Callable[[], list[Detector]] | None = None, custom_mime_types: list[str] | None = None, extra_jars: list[Path] | None = None, tika_jar_override: Path | None = None)[source]#

Bases: object

The main entrypoint class. Wraps management of the underlying Tika and JVM instances.

detect_language(content: str) TikaDetectLanguageResult[source]#

Detect the natural language of text content using Apache Tika’s language detection.

Uses statistical language detection models to identify the most likely language. Higher confidence and raw scores indicate more reliable detection. For best results, provide at least 50 characters of text.

Parameters:

content – Text content to analyze. Should be plain text, not markup/code.

Returns:

language: ISO 639-1 language code (e.g. “en” for English) confidence: Qualitative confidence level (HIGH/MEDIUM/LOW/NONE) raw_score: Numeric confidence score between 0 and 1

Return type:

TikaDetectLanguageResult with fields

Raises:
  • ValueError – If content is empty

  • RuntimeError – If language detection fails

Examples

High confidence detection:

tika = Tika()
result = tika.detect_language("The quick brown fox jumps over the lazy dog")
result.language
'en'
result.confidence
TikaLanguageConfidence.HIGH
result.raw_score
0.999

Lower confidence example:

result = tika.detect_language("123")
result.confidence
TikaLanguageConfidence.LOW

Other languages:

tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso").language
'es'

Notes

  • Models are loaded lazily on first use unless lazy_load=False in constructor

  • Supports ~70 languages including all major European and Asian languages

  • Short or ambiguous content may result in lower confidence scores

  • Language models are memory-intensive; loaded models persist until JVM shutdown

See also

  • examples/detect_language.ipynb: Additional language detection examples

detect_mime_type(obj: str | Path | bytes | BinaryIO) str[source]#

Detect the MIME type of a file, bytes, or stream.

Uses Apache Tika’s MIME type detection capabilities which combine file extension examination, magic bytes analysis, and content inspection. For best results when using streams/bytes, provide content from the beginning of the file since magic byte detection examines file headers.

Parameters:

obj – Input to detect MIME type for. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

Returns:

Detected MIME type in format “type/subtype” (e.g. “application/pdf”)

Return type:

str

Raises:
  • TypeError – If input type is not supported

  • FileNotFoundError – If input file path does not exist

  • ValueError – If detection fails

Examples

Path input:

tika = Tika()
tika.detect_mime_type("document.pdf")

Bytes input:

with open("document.pdf", "rb") as f:
    tika.detect_mime_type(f.read())

Stream input:

from io import BytesIO
bio = BytesIO(b"<html><body>Hello</body></html>")
tika.detect_mime_type(bio)

Notes

  • Supports all >1600 MIME types recognized by Apache Tika

  • Custom MIME types can be added via custom detectors

  • For reliable detection, provide at least 1KB of content when using bytes/streams

  • Detection order: custom detectors -> default Tika detectors

See also

  • examples/detect_mime_type.ipynb: More detection examples

  • examples/custom_detector.ipynb: Adding custom MIME type detection

parse(obj: TikaInputType, *, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[str, TikaMetadata][source]#
parse(obj: TikaInputType, *, output_file: Path | str, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[Path, TikaMetadata]
parse(obj: TikaInputType, *, output_stream: bool, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[BinaryIO, TikaMetadata]

Extract text content and metadata from documents.

Uses Apache Tika’s parsing capabilities to extract plain text or structured content from documents, along with metadata. Supports multiple input and output formats.

Parameters:
  • obj – Input to parse. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

  • output_stream – Whether to return content as a stream instead of string

  • output_format – Format for extracted text: - “txt”: Plain text without markup - “xhtml”: Structured XML with text formatting (default)

  • output_file – Save content to this path instead of returning it

  • input_file_name – Original filename if obj is bytes/stream

  • content_type – MIME type of input if known

Returns:

  • Content (type depends on output mode):
    • String if no output_file/output_stream

    • Path if output_file specified

    • BinaryIO if output_stream=True

  • Dict of metadata about the document

Return type:

Tuple containing

Raises:
  • ValueError – If output_file needed but not provided

  • FileNotFoundError – If input file doesn’t exist

  • TypeError – If input type not supported

Examples

Basic text extraction:

tika = Tika()
content, meta = tika.parse("report.pdf")
print(f"Title: {meta.get('title')}")
print(content[:100])  # First 100 chars

Stream output:

content, meta = tika.parse(
    "large.pdf",
    output_stream=True,
    output_format="txt"
... )
for line in content:
    process(line)

Save to file:

path, meta = tika.parse(
    "input.docx",
    output_file="extracted.txt",
    output_format="txt"
... )

Parse bytes with hints:

with open("doc.pdf", "rb") as f:
    content, meta = tika.parse(
        f.read(),
        input_file_name="doc.pdf",
        content_type="application/pdf"
    )

Notes

  • “xhtml” format preserves document structure

  • “txt” format gives clean plain text

  • Handles >1600 file formats

  • More accurate with filename/type hints

  • Streams good for large files

  • Metadata includes standard Dublin Core fields

See also

  • examples/parsing.ipynb: More parsing examples

unpack(obj: str | Path | bytes | BinaryIO, output_dir: Path, *, max_depth: int = 1, input_file_name: str | Path | None = None, content_type: str | None = None) TikaUnpackResult[source]#

Extract embedded documents from a container document recursively.

Extracts and saves embedded documents (e.g. images in PDFs, files in Office documents) to disk. Can recursively extract from nested containers up to specified depth.

Parameters:
  • obj – Input container document to extract from. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

  • output_dir – Directory to save extracted documents to. Created if doesn’t exist.

  • max_depth – Maximum recursion depth for nested containers. Default 1 extracts only top-level embedded docs.

  • input_file_name – Original filename if obj is bytes/stream. Helps with metadata extraction and also helps name the output of the root file in the output_dir. Only necessary if obj is bytes or stream.

  • content_type – MIME type of input if known. Helps with metadata extraction.

Returns:

root_metadata: Metadata of the root document embedded_documents: List of TikaUnpackedItem objects representing extracted files

Return type:

TikaUnpackResult with fields

Raises:
  • FileNotFoundError – If input file path doesn’t exist

  • ValueError – If input type not supported

  • RuntimeError – If extraction fails

Examples

tika = Tika()
items = tika.unpack("presentation.pptx", Path("extracted/"))
for item in items:
    print(f"Found {item.metadata['Content-Type']} at {item.file_path}")
Found image/png at extracted/image1.png
Found application/pdf at extracted/embedded.pdf

Notes

  • Creates output_dir if it doesn’t exist

  • Handles nested containers (ZIP, PDF, Office docs etc)

  • Extracts images, attachments, embedded files

  • Returns paths are relative to output_dir

  • Metadata includes content type, relations, properties

  • Extraction depth measured from input document

  • For streams/bytes, provide filename/type if possible

See also

  • examples/unpack.ipynb: Additional extraction examples

  • RecursiveEmbeddedDocumentExtractor: Core extraction logic

tikara.data_types module#

Common data types used in public methods and classes.

class tikara.data_types.TikaDetectLanguageResult(*, language: str, confidence: TikaLanguageConfidence, raw_score: float)[source]#

Bases: BaseModel

Represents the result of a language detection operation.

confidence: TikaLanguageConfidence[source]#
language: str[source]#
model_config: ClassVar[ConfigDict] = {}[source]#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

raw_score: float[source]#
class tikara.data_types.TikaLanguageConfidence(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: StrEnum

Enum representing the confidence level of a detected language result.

HIGH = 'HIGH'[source]#
LOW = 'LOW'[source]#
MEDIUM = 'MEDIUM'[source]#
NONE = 'NONE'[source]#
class tikara.data_types.TikaMetadata(*, encoding: str | None = None, compression: str | None = None, paragraph_count: int | None = None, revision: str | None = None, word_count: int | None = None, line_count: int | None = None, character_count: int | None = None, character_count_with_spaces: int | None = None, page_count: int | None = None, chars_per_page: list[int] | int | None = None, table_count: int | str | None = None, component_count: int | None = None, image_count: int | None = None, hidden_slides: str | None = None, resource_name: str | None = None, resource_path: str | None = None, embedded_resource_type: str | None = None, embedded_relationship_id: str | None = None, embedded_depth: int | None = None, created: str | None = None, modified: str | None = None, content_type: str | None = None, content_type_override: str | None = None, content_length: int | None = None, title: str | None = None, description: str | None = None, type: str | None = None, keywords: str | list[str] | None = None, company: str | None = None, creator: str | None = None, publisher: str | None = None, contributor: str | None = None, language: str | None = None, identifier: str | None = None, application: str | None = None, application_version: str | None = None, producer: str | None = None, version: str | None = None, template: str | None = None, security: str | None = None, is_encrypted: bool | str | None = None, height: int | str | None = None, width: int | str | None = None, duration: float | str | None = None, sample_rate: int | str | None = None, stream_count: int | str | None = None, image_pixel_aspect_ratio: float | str | None = None, image_color_space: str | None = None, audio_channels: int | str | None = None, audio_bits: int | str | None = None, audio_sample_type: str | None = None, audio_encoding: str | None = None, video_frame_rate: float | str | None = None, video_codec: str | None = None, video_frame_count: int | str | None = None, from_: str | None = None, to: str | None = None, cc: str | None = None, bcc: str | None = None, multipart_subtypes: str | None = None, multipart_boundary: str | None = None, raw_metadata: dict[str, ~typing.Any] = <factory>)[source]#

Bases: BaseModel

Normalized metadata from Tika document processing with standardized field names.

application: str | None[source]#
application_version: str | None[source]#
audio_bits: int | str | None[source]#
audio_channels: int | str | None[source]#
audio_encoding: str | None[source]#
audio_sample_type: str | None[source]#
bcc: str | None[source]#
cc: str | None[source]#
character_count: int | None[source]#
character_count_with_spaces: int | None[source]#
chars_per_page: list[int] | int | None[source]#
company: str | None[source]#
component_count: int | None[source]#
compression: str | None[source]#
content_length: int | None[source]#
content_type: str | None[source]#
content_type_override: str | None[source]#
contributor: str | None[source]#
created: str | None[source]#
creator: str | None[source]#
description: str | None[source]#
duration: float | str | None[source]#
embedded_depth: int | None[source]#
embedded_relationship_id: str | None[source]#
embedded_resource_type: str | None[source]#
encoding: str | None[source]#
from_: str | None[source]#
height: int | str | None[source]#
hidden_slides: str | None[source]#
identifier: str | None[source]#
image_color_space: str | None[source]#
image_count: int | None[source]#
image_pixel_aspect_ratio: float | str | None[source]#
is_encrypted: bool | str | None[source]#
keywords: str | list[str] | None[source]#
language: str | None[source]#
line_count: int | None[source]#
model_config: ClassVar[ConfigDict] = {}[source]#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

modified: str | None[source]#
multipart_boundary: str | None[source]#
multipart_subtypes: str | None[source]#
page_count: int | None[source]#
paragraph_count: int | None[source]#
producer: str | None[source]#
publisher: str | None[source]#
raw_metadata: dict[str, Any][source]#
resource_name: str | None[source]#
resource_path: str | None[source]#
revision: str | None[source]#
sample_rate: int | str | None[source]#
security: str | None[source]#
stream_count: int | str | None[source]#
table_count: int | str | None[source]#
template: str | None[source]#
title: str | None[source]#
to: str | None[source]#
type: str | None[source]#
version: str | None[source]#
video_codec: str | None[source]#
video_frame_count: int | str | None[source]#
video_frame_rate: float | str | None[source]#
width: int | str | None[source]#
word_count: int | None[source]#
class tikara.data_types.TikaUnpackResult(*, root_metadata: ~tikara.data_types.TikaMetadata, embedded_documents: list[~tikara.data_types.TikaUnpackedItem] = <factory>)[source]#

Bases: BaseModel

Result of unpacking a document with embedded files.

embedded_documents: list[TikaUnpackedItem][source]#
model_config: ClassVar[ConfigDict] = {}[source]#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

root_metadata: TikaMetadata[source]#
class tikara.data_types.TikaUnpackedItem(*, metadata: TikaMetadata, file_path: Path)[source]#

Bases: BaseModel

Individual unpacked embedded document.

file_path: Path[source]#
metadata: TikaMetadata[source]#
model_config: ClassVar[ConfigDict] = {}[source]#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

tikara.error_handling module#

Collection of custom exceptions for Tikara and error handling utils.

exception tikara.error_handling.TikaError[source]#

Bases: Exception

Base class for all exceptions raised by Tikara.

exception tikara.error_handling.TikaInitializationError[source]#

Bases: TikaError

Raised when the Tika server fails to initialize.

exception tikara.error_handling.TikaInputArgumentsError[source]#

Bases: TikaError

Raised when the input parameters to a method is invalid.

exception tikara.error_handling.TikaInputFileNotFoundError[source]#

Bases: TikaInputArgumentsError

Raised when the input file or directory is not found.

exception tikara.error_handling.TikaInputTypeError[source]#

Bases: TikaInputArgumentsError

Raised when the input obj type is invalid.

exception tikara.error_handling.TikaMimeTypeError[source]#

Bases: TikaError

Raised when the mimetype is invalid.

exception tikara.error_handling.TikaOutputFormatError[source]#

Bases: TikaInputArgumentsError

Raised when the output format is invalid.

exception tikara.error_handling.TikaOutputModeError[source]#

Bases: TikaInputArgumentsError

Raised when the output mode is invalid.

tikara.error_handling.wrap_exceptions(func: Callable[[P], R]) Callable[[P], R][source]#

Wrap a function to convert Java Tika exceptions to Python TikaError.

Parameters:

func – The function to wrap

Returns:

Wrapped function that converts Java exceptions to Python exceptions

Raises:

TikaError – When a TikaException occurs

Module contents#

Main package entrypoint for Tikara.

class tikara.Tika(*, lazy_load: bool = True, custom_parsers: list[Parser] | Callable[[], list[Parser]] | None = None, custom_detectors: list[Detector] | Callable[[], list[Detector]] | None = None, custom_mime_types: list[str] | None = None, extra_jars: list[Path] | None = None, tika_jar_override: Path | None = None)[source]#

Bases: object

The main entrypoint class. Wraps management of the underlying Tika and JVM instances.

detect_language(content: str) TikaDetectLanguageResult[source]#

Detect the natural language of text content using Apache Tika’s language detection.

Uses statistical language detection models to identify the most likely language. Higher confidence and raw scores indicate more reliable detection. For best results, provide at least 50 characters of text.

Parameters:

content – Text content to analyze. Should be plain text, not markup/code.

Returns:

language: ISO 639-1 language code (e.g. “en” for English) confidence: Qualitative confidence level (HIGH/MEDIUM/LOW/NONE) raw_score: Numeric confidence score between 0 and 1

Return type:

TikaDetectLanguageResult with fields

Raises:
  • ValueError – If content is empty

  • RuntimeError – If language detection fails

Examples

High confidence detection:

tika = Tika()
result = tika.detect_language("The quick brown fox jumps over the lazy dog")
result.language
'en'
result.confidence
TikaLanguageConfidence.HIGH
result.raw_score
0.999

Lower confidence example:

result = tika.detect_language("123")
result.confidence
TikaLanguageConfidence.LOW

Other languages:

tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso").language
'es'

Notes

  • Models are loaded lazily on first use unless lazy_load=False in constructor

  • Supports ~70 languages including all major European and Asian languages

  • Short or ambiguous content may result in lower confidence scores

  • Language models are memory-intensive; loaded models persist until JVM shutdown

See also

  • examples/detect_language.ipynb: Additional language detection examples

detect_mime_type(obj: str | Path | bytes | BinaryIO) str[source]#

Detect the MIME type of a file, bytes, or stream.

Uses Apache Tika’s MIME type detection capabilities which combine file extension examination, magic bytes analysis, and content inspection. For best results when using streams/bytes, provide content from the beginning of the file since magic byte detection examines file headers.

Parameters:

obj – Input to detect MIME type for. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

Returns:

Detected MIME type in format “type/subtype” (e.g. “application/pdf”)

Return type:

str

Raises:
  • TypeError – If input type is not supported

  • FileNotFoundError – If input file path does not exist

  • ValueError – If detection fails

Examples

Path input:

tika = Tika()
tika.detect_mime_type("document.pdf")

Bytes input:

with open("document.pdf", "rb") as f:
    tika.detect_mime_type(f.read())

Stream input:

from io import BytesIO
bio = BytesIO(b"<html><body>Hello</body></html>")
tika.detect_mime_type(bio)

Notes

  • Supports all >1600 MIME types recognized by Apache Tika

  • Custom MIME types can be added via custom detectors

  • For reliable detection, provide at least 1KB of content when using bytes/streams

  • Detection order: custom detectors -> default Tika detectors

See also

  • examples/detect_mime_type.ipynb: More detection examples

  • examples/custom_detector.ipynb: Adding custom MIME type detection

parse(obj: TikaInputType, *, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[str, TikaMetadata][source]#
parse(obj: TikaInputType, *, output_file: Path | str, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[Path, TikaMetadata]
parse(obj: TikaInputType, *, output_stream: bool, output_format: TikaParseOutputFormat = 'xhtml', input_file_name: str | Path | None = None, content_type: str | None = None) tuple[BinaryIO, TikaMetadata]

Extract text content and metadata from documents.

Uses Apache Tika’s parsing capabilities to extract plain text or structured content from documents, along with metadata. Supports multiple input and output formats.

Parameters:
  • obj – Input to parse. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

  • output_stream – Whether to return content as a stream instead of string

  • output_format – Format for extracted text: - “txt”: Plain text without markup - “xhtml”: Structured XML with text formatting (default)

  • output_file – Save content to this path instead of returning it

  • input_file_name – Original filename if obj is bytes/stream

  • content_type – MIME type of input if known

Returns:

  • Content (type depends on output mode):
    • String if no output_file/output_stream

    • Path if output_file specified

    • BinaryIO if output_stream=True

  • Dict of metadata about the document

Return type:

Tuple containing

Raises:
  • ValueError – If output_file needed but not provided

  • FileNotFoundError – If input file doesn’t exist

  • TypeError – If input type not supported

Examples

Basic text extraction:

tika = Tika()
content, meta = tika.parse("report.pdf")
print(f"Title: {meta.get('title')}")
print(content[:100])  # First 100 chars

Stream output:

content, meta = tika.parse(
    "large.pdf",
    output_stream=True,
    output_format="txt"
... )
for line in content:
    process(line)

Save to file:

path, meta = tika.parse(
    "input.docx",
    output_file="extracted.txt",
    output_format="txt"
... )

Parse bytes with hints:

with open("doc.pdf", "rb") as f:
    content, meta = tika.parse(
        f.read(),
        input_file_name="doc.pdf",
        content_type="application/pdf"
    )

Notes

  • “xhtml” format preserves document structure

  • “txt” format gives clean plain text

  • Handles >1600 file formats

  • More accurate with filename/type hints

  • Streams good for large files

  • Metadata includes standard Dublin Core fields

See also

  • examples/parsing.ipynb: More parsing examples

unpack(obj: str | Path | bytes | BinaryIO, output_dir: Path, *, max_depth: int = 1, input_file_name: str | Path | None = None, content_type: str | None = None) TikaUnpackResult[source]#

Extract embedded documents from a container document recursively.

Extracts and saves embedded documents (e.g. images in PDFs, files in Office documents) to disk. Can recursively extract from nested containers up to specified depth.

Parameters:
  • obj – Input container document to extract from. Can be: - Path or str: Filesystem path - bytes: Raw content bytes - BinaryIO: File-like object in binary mode

  • output_dir – Directory to save extracted documents to. Created if doesn’t exist.

  • max_depth – Maximum recursion depth for nested containers. Default 1 extracts only top-level embedded docs.

  • input_file_name – Original filename if obj is bytes/stream. Helps with metadata extraction and also helps name the output of the root file in the output_dir. Only necessary if obj is bytes or stream.

  • content_type – MIME type of input if known. Helps with metadata extraction.

Returns:

root_metadata: Metadata of the root document embedded_documents: List of TikaUnpackedItem objects representing extracted files

Return type:

TikaUnpackResult with fields

Raises:
  • FileNotFoundError – If input file path doesn’t exist

  • ValueError – If input type not supported

  • RuntimeError – If extraction fails

Examples

tika = Tika()
items = tika.unpack("presentation.pptx", Path("extracted/"))
for item in items:
    print(f"Found {item.metadata['Content-Type']} at {item.file_path}")
Found image/png at extracted/image1.png
Found application/pdf at extracted/embedded.pdf

Notes

  • Creates output_dir if it doesn’t exist

  • Handles nested containers (ZIP, PDF, Office docs etc)

  • Extracts images, attachments, embedded files

  • Returns paths are relative to output_dir

  • Metadata includes content type, relations, properties

  • Extraction depth measured from input document

  • For streams/bytes, provide filename/type if possible

See also

  • examples/unpack.ipynb: Additional extraction examples

  • RecursiveEmbeddedDocumentExtractor: Core extraction logic

class tikara.TikaDetectLanguageResult(*, language: str, confidence: TikaLanguageConfidence, raw_score: float)[source]#

Bases: BaseModel

Represents the result of a language detection operation.

confidence: TikaLanguageConfidence#
language: str#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

raw_score: float#
exception tikara.TikaError[source]#

Bases: Exception

Base class for all exceptions raised by Tikara.

class tikara.TikaLanguageConfidence(value, names=<not given>, *values, module=None, qualname=None, type=None, start=1, boundary=None)[source]#

Bases: StrEnum

Enum representing the confidence level of a detected language result.

HIGH = 'HIGH'#
LOW = 'LOW'#
MEDIUM = 'MEDIUM'#
NONE = 'NONE'#
class tikara.TikaMetadata(*, encoding: str | None = None, compression: str | None = None, paragraph_count: int | None = None, revision: str | None = None, word_count: int | None = None, line_count: int | None = None, character_count: int | None = None, character_count_with_spaces: int | None = None, page_count: int | None = None, chars_per_page: list[int] | int | None = None, table_count: int | str | None = None, component_count: int | None = None, image_count: int | None = None, hidden_slides: str | None = None, resource_name: str | None = None, resource_path: str | None = None, embedded_resource_type: str | None = None, embedded_relationship_id: str | None = None, embedded_depth: int | None = None, created: str | None = None, modified: str | None = None, content_type: str | None = None, content_type_override: str | None = None, content_length: int | None = None, title: str | None = None, description: str | None = None, type: str | None = None, keywords: str | list[str] | None = None, company: str | None = None, creator: str | None = None, publisher: str | None = None, contributor: str | None = None, language: str | None = None, identifier: str | None = None, application: str | None = None, application_version: str | None = None, producer: str | None = None, version: str | None = None, template: str | None = None, security: str | None = None, is_encrypted: bool | str | None = None, height: int | str | None = None, width: int | str | None = None, duration: float | str | None = None, sample_rate: int | str | None = None, stream_count: int | str | None = None, image_pixel_aspect_ratio: float | str | None = None, image_color_space: str | None = None, audio_channels: int | str | None = None, audio_bits: int | str | None = None, audio_sample_type: str | None = None, audio_encoding: str | None = None, video_frame_rate: float | str | None = None, video_codec: str | None = None, video_frame_count: int | str | None = None, from_: str | None = None, to: str | None = None, cc: str | None = None, bcc: str | None = None, multipart_subtypes: str | None = None, multipart_boundary: str | None = None, raw_metadata: dict[str, ~typing.Any] = <factory>)[source]#

Bases: BaseModel

Normalized metadata from Tika document processing with standardized field names.

application: str | None#
application_version: str | None#
audio_bits: int | str | None#
audio_channels: int | str | None#
audio_encoding: str | None#
audio_sample_type: str | None#
bcc: str | None#
cc: str | None#
character_count: int | None#
character_count_with_spaces: int | None#
chars_per_page: list[int] | int | None#
company: str | None#
component_count: int | None#
compression: str | None#
content_length: int | None#
content_type: str | None#
content_type_override: str | None#
contributor: str | None#
created: str | None#
creator: str | None#
description: str | None#
duration: float | str | None#
embedded_depth: int | None#
embedded_relationship_id: str | None#
embedded_resource_type: str | None#
encoding: str | None#
from_: str | None#
height: int | str | None#
hidden_slides: str | None#
identifier: str | None#
image_color_space: str | None#
image_count: int | None#
image_pixel_aspect_ratio: float | str | None#
is_encrypted: bool | str | None#
keywords: str | list[str] | None#
language: str | None#
line_count: int | None#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

modified: str | None#
multipart_boundary: str | None#
multipart_subtypes: str | None#
page_count: int | None#
paragraph_count: int | None#
producer: str | None#
publisher: str | None#
raw_metadata: dict[str, Any]#
resource_name: str | None#
resource_path: str | None#
revision: str | None#
sample_rate: int | str | None#
security: str | None#
stream_count: int | str | None#
table_count: int | str | None#
template: str | None#
title: str | None#
to: str | None#
type: str | None#
version: str | None#
video_codec: str | None#
video_frame_count: int | str | None#
video_frame_rate: float | str | None#
width: int | str | None#
word_count: int | None#
class tikara.TikaUnpackResult(*, root_metadata: ~tikara.data_types.TikaMetadata, embedded_documents: list[~tikara.data_types.TikaUnpackedItem] = <factory>)[source]#

Bases: BaseModel

Result of unpacking a document with embedded files.

embedded_documents: list[TikaUnpackedItem]#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].

root_metadata: TikaMetadata#
class tikara.TikaUnpackedItem(*, metadata: TikaMetadata, file_path: Path)[source]#

Bases: BaseModel

Individual unpacked embedded document.

file_path: Path#
metadata: TikaMetadata#
model_config: ClassVar[ConfigDict] = {}#

Configuration for the model, should be a dictionary conforming to [ConfigDict][pydantic.config.ConfigDict].