Class TikaLuceneContentExtractor

java.lang.Object
org.apache.cxf.jaxrs.ext.search.tika.TikaLuceneContentExtractor

public class TikaLuceneContentExtractor extends Object
  • Constructor Details

    • TikaLuceneContentExtractor

      public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser)
      Create new Tika-based content extractor using the provided parser instance.
      Parameters:
      parser - parser instance
    • TikaLuceneContentExtractor

      public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType)
      Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media typesthis.contentFieldName supported by the parser.
      Parameters:
      parser - parser instance
      validateMediaType - enabled or disable media type validation
    • TikaLuceneContentExtractor

      public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, LuceneDocumentMetadata documentMetadata)
      Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
      Parameters:
      parser - parser instancethis.contentFieldName
      documentMetadata - documentMetadata
    • TikaLuceneContentExtractor

      public TikaLuceneContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType, LuceneDocumentMetadata documentMetadata)
      Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
      Parameters:
      parser - parser instancethis.contentFieldName
      validateMediaType - enabled or disable media type validation
      documentMetadata - documentMetadata
    • TikaLuceneContentExtractor

      public TikaLuceneContentExtractor(List<org.apache.tika.parser.Parser> parsers, LuceneDocumentMetadata documentMetadata)
      Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation will try to detect the media type of the input and validate it against media types supported by the parser.
      Parameters:
      parsers - parsers instancethis.contentFieldName
      documentMetadata - documentMetadata
  • Method Details

    • extract

      public org.apache.lucene.document.Document extract(InputStream in)
      Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content and metadata from
      Returns:
      the extracted document or null if extraction is not possible or was unsuccessful
    • extract

      public org.apache.lucene.document.Document extract(InputStream in, LuceneDocumentMetadata documentMetadata)
      Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content and metadata from
      documentMetadata - documentMetadata
      Returns:
      the extracted document or null if extraction is not possible or was unsuccessful
    • extractContent

      public org.apache.lucene.document.Document extractContent(InputStream in)
      Extract the content only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content from
      Returns:
      the extracted document or null if extraction is not possible or was unsuccessful
    • extractMetadata

      public org.apache.lucene.document.Document extractMetadata(InputStream in)
      Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the metadata from
      Returns:
      the extracted document or null if extraction is not possible or was unsuccessful
    • extractMetadata

      public org.apache.lucene.document.Document extractMetadata(InputStream in, LuceneDocumentMetadata documentMetadata)
      Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the metadata from
      documentMetadata - documentMetadata
      Returns:
      the extracted document or null if extraction is not possible or was unsuccessful