Class TikaLuceneContentExtractor
java.lang.Object
org.apache.cxf.jaxrs.ext.search.tika.TikaLuceneContentExtractor
-
Constructor Summary
ConstructorsConstructorDescriptionTikaLuceneContentExtractor(List<org.apache.tika.parser.Parser> parsers, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation.TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser) Create new Tika-based content extractor using the provided parser instance.TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType) Create new Tika-based content extractor using the provided parser instance and optional media type validation.TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation.TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation. -
Method Summary
Modifier and TypeMethodDescriptionorg.apache.lucene.document.Documentextract(InputStream in) Extract the content and metadata from the input stream.org.apache.lucene.document.Documentextract(InputStream in, LuceneDocumentMetadata documentMetadata) Extract the content and metadata from the input stream.org.apache.lucene.document.DocumentExtract the content only from the input stream.org.apache.lucene.document.DocumentExtract the metadata only from the input stream.org.apache.lucene.document.DocumentextractMetadata(InputStream in, LuceneDocumentMetadata documentMetadata) Extract the metadata only from the input stream.
-
Constructor Details
-
TikaLuceneContentExtractor
public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser) Create new Tika-based content extractor using the provided parser instance.- Parameters:
parser- parser instance
-
TikaLuceneContentExtractor
public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType) Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media typesthis.contentFieldName supported by the parser.- Parameters:
parser- parser instancevalidateMediaType- enabled or disable media type validation
-
TikaLuceneContentExtractor
public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.- Parameters:
parser- parser instancethis.contentFieldNamedocumentMetadata- documentMetadata
-
TikaLuceneContentExtractor
public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.- Parameters:
parser- parser instancethis.contentFieldNamevalidateMediaType- enabled or disable media type validationdocumentMetadata- documentMetadata
-
TikaLuceneContentExtractor
public TikaLuceneContentExtractor(List<org.apache.tika.parser.Parser> parsers, LuceneDocumentMetadata documentMetadata) Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.- Parameters:
parsers- parsers instancethis.contentFieldNamedocumentMetadata- documentMetadata
-
-
Method Details
-
extract
Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.- Parameters:
in- input stream to extract the content and metadata from- Returns:
- the extracted document or null if extraction is not possible or was unsuccessful
-
extract
public org.apache.lucene.document.Document extract(InputStream in, LuceneDocumentMetadata documentMetadata) Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.- Parameters:
in- input stream to extract the content and metadata fromdocumentMetadata- documentMetadata- Returns:
- the extracted document or null if extraction is not possible or was unsuccessful
-
extractContent
Extract the content only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.- Parameters:
in- input stream to extract the content from- Returns:
- the extracted document or null if extraction is not possible or was unsuccessful
-
extractMetadata
Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.- Parameters:
in- input stream to extract the metadata from- Returns:
- the extracted document or null if extraction is not possible or was unsuccessful
-
extractMetadata
public org.apache.lucene.document.Document extractMetadata(InputStream in, LuceneDocumentMetadata documentMetadata) Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.- Parameters:
in- input stream to extract the metadata fromdocumentMetadata- documentMetadata- Returns:
- the extracted document or null if extraction is not possible or was unsuccessful
-