Package org.apache.tika.parser.pdf
Class PDFMarkedContent2XHTML
java.lang.Object
org.apache.pdfbox.contentstream.PDFStreamEngine
org.apache.pdfbox.text.PDFTextStripper
org.apache.tika.parser.pdf.PDFMarkedContent2XHTML
@Deprecated(since="2026-04-30")
public class PDFMarkedContent2XHTML
extends org.apache.pdfbox.text.PDFTextStripper
Deprecated.
This was added in Tika 1.24 as an alpha version of a text extractor that builds the text from the marked text tree and includes/normalizes some of the structural tags.
- Since:
- 1.24
-
Field Summary
Fields -
Method Summary
Modifier and TypeMethodDescriptionintDeprecated.we need to override this because we are overridingprocessPages(PDPageTree)intDeprecated.static voidprocess(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) Deprecated.Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.voidprocessPage(org.apache.pdfbox.pdmodel.PDPage page) Deprecated.voidsetEndBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) Deprecated.voidsetStartBookmark(org.apache.pdfbox.pdmodel.interactive.documentnavigation.outline.PDOutlineItem pdOutlineItem) Deprecated.voidsetStartPage(int startPage) Deprecated.Methods inherited from class org.apache.pdfbox.text.PDFTextStripper
getAddMoreFormatting, getArticleEnd, getArticleStart, getAverageCharTolerance, getDropThreshold, getEndBookmark, getEndPage, getIndentThreshold, getLineSeparator, getPageEnd, getPageStart, getParagraphEnd, getParagraphStart, getSeparateByBeads, getSortByPosition, getSpacingTolerance, getStartBookmark, getSuppressDuplicateOverlappingText, getText, getWordSeparator, setAddMoreFormatting, setArticleEnd, setArticleStart, setAverageCharTolerance, setDropThreshold, setEndPage, setIndentThreshold, setLineSeparator, setPageEnd, setPageStart, setParagraphEnd, setParagraphStart, setShouldSeparateByBeads, setSortByPosition, setSpacingTolerance, setSuppressDuplicateOverlappingText, setWordSeparator, writeTextMethods inherited from class org.apache.pdfbox.contentstream.PDFStreamEngine
addOperator, beginMarkedContentSequence, beginText, decreaseLevel, endMarkedContentSequence, endText, getAppearance, getCurrentPage, getGraphicsStackSize, getGraphicsState, getInitialMatrix, getLevel, getResources, getTextLineMatrix, getTextMatrix, increaseLevel, processOperator, registerOperatorProcessor, restoreGraphicsState, saveGraphicsState, setLineDashPattern, setTextLineMatrix, setTextMatrix, showAnnotation, showForm, showTextString, showTextStrings, showTransparencyGroup, transformedPoint
-
Field Details
-
XMP_DOCUMENT_CATALOG_LOCATION
Deprecated.- See Also:
-
XMP_PAGE_LOCATION_PREFIX
Deprecated.- See Also:
-
-
Method Details
-
process
public static void process(org.apache.pdfbox.pdmodel.PDDocument pdDocument, ContentHandler handler, ParseContext context, Metadata metadata, PDFParserConfig config) throws SAXException, TikaException Deprecated.Converts the given PDF document (and related metadata) to a stream of XHTML SAX events sent to the given content handler.- Parameters:
pdDocument- PDF documenthandler- SAX content handlermetadata- PDF metadata- Throws:
SAXException- if the content handler fails to process SAX eventsTikaException- if there was an exception outside of per page processing
-
processPage
Deprecated.- Overrides:
processPagein classorg.apache.pdfbox.text.PDFTextStripper- Throws:
IOException
-
getCurrentPageNo
public int getCurrentPageNo()Deprecated.we need to override this because we are overridingprocessPages(PDPageTree)- Returns:
-
setStartPage
public void setStartPage(int startPage) Deprecated.- Overrides:
setStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-
getStartPage
public int getStartPage()Deprecated.- Overrides:
getStartPagein classorg.apache.pdfbox.text.PDFTextStripper
-