Welcome to Tikara’s documentation!#
Tikara#
🚀 Overview#
Tikara is a modern, type-hinted Python wrapper for Apache Tika, supporting over 1600 file formats for content extraction, metadata analysis, and language detection. It provides direct JNI integration through JPype for optimal performance.
from tikara import Tika
tika = Tika()
content, metadata = tika.parse("document.pdf")
⚡️ Key Features#
Modern Python 3.12+ with complete type hints
Direct JVM integration via JPype (no HTTP server required)
Streaming support for large files
Recursive document unpacking
Language detection
MIME type detection
Custom parser and detector support
Comprehensive metadata extraction
Ships with embedded Tika JAR: works in air-gapped networks. No need to manage libraries.
Opinionated Pydantic wrapper over Tika’s metadata model, with access to the raw metadata.
📦 Supported Formats#
🌈 1682 supported media types and counting!
🛠️ Installation#
pip install tikara
System Dependencies#
Required Dependencies#
Python 3.12+
Java Development Kit 11+ (OpenJDK recommended)
Optional Dependencies#
Image and PDF OCR Enhancements (recommended)#
Tesseract OCR (strongly recommended if you process images) (Reference ⇗)
# Ubuntu apt-get install tesseract-ocrAdditional language packs for Tesseract (optional):
# Ubuntu apt-get install tesseract-ocr-deu tesseract-ocr-fra tesseract-ocr-ita tesseract-ocr-spaImageMagick for advanced image processing (Reference ⇗)
# Ubuntu apt-get install imagemagick
Multimedia Enhancements (recommended)#
FFMPEG for enhanced multimedia file support (Reference ⇗)
# Ubuntu apt-get install ffmpeg
Enhanced PDF Support (recommended)#
PDFBox ⇗ for enhanced PDF support (Reference ⇗)
# Ubuntu apt-get install pdfbox
Enhanced PDF support with PDFBox Reference ⇗
Metadata Enhancements (recommended)#
EXIFTool for metadata extraction from images Reference ⇗
# Ubuntu apt-get install libimage-exiftool-perl
Geospatial Enhancements#
GDAL for geospatial file support (Reference ⇗)
# Ubuntu apt-get install gdal-bin
Additional Font Support (recommended)#
MSCore Fonts for enhanced Office file handling (Reference ⇗)
# Ubuntu apt-get install xfonts-utils fonts-freefont-ttf fonts-liberation ttf-mscorefonts-installer
For more OS dependency information including MSCore fonts setup and additional configuration, see the official Apache Tika Dockerfile.
📖 Usage#
Basic Content Extraction#
from tikara import Tika
from pathlib import Path
tika = Tika()
# Basic string output
content, metadata = tika.parse("document.pdf")
# Stream large files
stream, metadata = tika.parse(
"large.pdf",
output_stream=True,
output_format="txt"
)
# Save to file
output_path, metadata = tika.parse(
"input.docx",
output_file=Path("output.txt"),
output_format="txt"
)
Language Detection#
from tikara import Tika
tika = Tika()
result = tika.detect_language("El rápido zorro marrón salta sobre el perro perezoso")
print(f"Language: {result.language}, Confidence: {result.confidence}")
MIME Type Detection#
from tikara import Tika
tika = Tika()
mime_type = tika.detect_mime_type("unknown_file")
print(f"Detected type: {mime_type}")
Recursive Document Unpacking#
from tikara import Tika
from pathlib import Path
tika = Tika()
results = tika.unpack(
"container.docx",
output_dir=Path("extracted"),
max_depth=3
)
for item in results:
print(f"Extracted {item.metadata['Content-Type']} to {item.file_path}")
🔧 Development#
Environment Setup#
Ensure that you have the system dependencies installed
Install uv:
curl -LsSf https://astral.sh/uv/install.sh | shInstall python dependencies and create the Virtual Environment:
make install
Common Tasks#
Run make (or make help) to see all available targets. The most common ones:
# Setup
make install # Install all dependencies (including dev)
make stubs # Regenerate Java type stubs from the Tika JAR
# Lint & Format
make lint # Run ruff linter (with auto-fix)
make format # Run ruff formatter
make ruff # Run linter and formatter together
# Test
make test # Run tests with verbose output
make test-fast # Run tests, skip slow benchmark/isolated markers
make test-coverage # Run tests with coverage report (XML + terminal)
# Docs
make docs # Build Sphinx HTML docs
make docs-open # Build docs and open in browser
# Security
make safety # Run safety dependency vulnerability scan
# Build & Release
make build # Build sdist and wheel
make clean # Remove build artifacts, caches, and generated reports
# CI / Pre-push
make ci # Run full CI suite (lint → test → safety → docs)
make prepush # Alias for ci — run before pushing
🤔 When to Use Tikara#
Ideal Use Cases#
Advanced Usage#
For detailed documentation on:
Custom parser implementation
Custom detector creation
MIME type handling
See the Example Jupyter Notebooks 📔
🎯 Inspiration#
Tikara builds on the shoulders of giants:
Apache Tika - The powerful content detection and extraction toolkit
tika-python - The original Python Tika wrapper using HTTP that inspired this project
JPype - The bridge between Python and Java
Considerations#
Process isolation: Tika crashes will affect the host application
Memory management: Large documents require careful handling
JVM startup: Initial overhead for first operation
Custom implementations: Parser/detector development requires Java interface knowledge
📊 Performance Considerations#
Memory Management#
Use streaming for large files
Monitor JVM heap usage
Consider process isolation for critical applications
Optimization Tips#
Reuse Tika instances
Use appropriate output formats
Implement custom parsers for specific needs
Configure JVM parameters for your use case
🔐 Security Considerations#
Input validation
Resource limits
Secure file handling
Access control for extracted content
Careful handling of custom parsers
🤝 Contributing#
Contributions welcome! The project uses Make for development tasks:
make prepush # Run full CI suite (lint, test, coverage, safety, docs)
For developing custom parsers/detectors, Java stubs can be generated:
make stubs # Generate Java stubs for Apache Tika interfaces
Note: Generated stubs are git-ignored but provide IDE support and type hints when implementing custom parsers/detectors.
Common Problems#
Verify Java installation and
JAVA_HOMEenvironment variableEnsure Tesseract and required language packs are installed
Check file permissions and paths
Monitor memory usage when processing large files
Use streaming output for large documents
📚 Reference#
See API Documentation for complete details.
📄 License#
Apache License 2.0 - See LICENSE for details.