Class TikaContentExtractor

java.lang.Object
org.apache.cxf.jaxrs.ext.search.tika.TikaContentExtractor

public class TikaContentExtractor extends Object
  • Constructor Details

    • TikaContentExtractor

      public TikaContentExtractor()
      Create new Tika-based content extractor using AutoDetectParser.
    • TikaContentExtractor

      public TikaContentExtractor(org.apache.tika.parser.Parser parser)
      Create new Tika-based content extractor using the provided parser instance.
      Parameters:
      parser - parser instance
    • TikaContentExtractor

      public TikaContentExtractor(List<org.apache.tika.parser.Parser> parsers)
      Create new Tika-based content extractor using the provided parser instances.
      Parameters:
      parsers - parser instances
    • TikaContentExtractor

      public TikaContentExtractor(List<org.apache.tika.parser.Parser> parsers, org.apache.tika.detect.Detector detector)
      Create new Tika-based content extractor using the provided parser instances.
      Parameters:
      parsers - parser instances
    • TikaContentExtractor

      public TikaContentExtractor(org.apache.tika.parser.Parser parser, boolean validateMediaType)
      Create new Tika-based content extractor using the provided parser instance and optional media type validation. If validation is enabled, the implementation parser will try to detect the media type of the input and validate it against media types supported by the parser.
      Parameters:
      parser - parser instance
      validateMediaType - enabled or disable media type validationparser
  • Method Details

    • extract

      Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content and metadata from
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extract

      public TikaContentExtractor.TikaContent extract(InputStream in, jakarta.ws.rs.core.MediaType mt)
      Extract the content and metadata from the input stream with a media type hint.
      Parameters:
      in - input stream to extract the content and metadata from
      mt - JAX-RS MediaType of the stream content
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extract

      Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content and metadata from
      handler - custom ContentHandler
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extract

      public TikaContentExtractor.TikaContent extract(InputStream in, ContentHandler handler, jakarta.ws.rs.core.MediaType mt)
      Extract the content and metadata from the input stream with a media type hint.
      Parameters:
      in - input stream to extract the content and metadata from
      handler - custom ContentHandler
      mt - JAX-RS MediaType of the stream content
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extract

      public TikaContentExtractor.TikaContent extract(InputStream in, ContentHandler handler, org.apache.tika.parser.ParseContext context)
      Extract the content and metadata from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the content and metadata from
      handler - custom ContentHandler
      context - custom context
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extract

      public TikaContentExtractor.TikaContent extract(InputStream in, ContentHandler handler, jakarta.ws.rs.core.MediaType mtHint, org.apache.tika.parser.ParseContext context)
      Extract the content and metadata from the input stream with a media type hint type of content.
      Parameters:
      in - input stream to extract the metadata from
      handler - custom ContentHandler
      mtHint - JAX-RS MediaType of the stream content
      context - custom context
      Returns:
      the extracted content and metadata or null if extraction is not possible or was unsuccessful
    • extractMetadata

      public TikaContentExtractor.TikaContent extractMetadata(InputStream in)
      Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the metadata from
      Returns:
      the extracted content or null if extraction is not possible or was unsuccessful
    • extractMetadataToSearchBean

      public SearchBean extractMetadataToSearchBean(InputStream in)
      Extract the metadata only from the input stream. Depending on media type validation, the detector could be run against input stream in order to ensure that parser supports this type of content.
      Parameters:
      in - input stream to extract the metadata from
      Returns:
      the extracted metadata converted to SearchBean or null if extraction is not possible or was unsuccessful