public abstract class A_CmsTextExtractor extends java.lang.Object implements I_CmsTextExtractor
构造器和说明 |
---|
A_CmsTextExtractor() |
限定符和类型 | 方法和说明 |
---|---|
protected void |
combineContentItem(java.lang.String itemValue,
java.lang.String itemKey,
java.lang.StringBuffer content,
java.util.Map<java.lang.String,java.lang.String> contentItems)
Combines a meta information item extracted from the document with the main content buffer and
also stores the individual information as item in the Map of content items.
|
I_CmsExtractionResult |
extractText(byte[] content)
Extracts the text and meta information from the given binary document.
|
I_CmsExtractionResult |
extractText(byte[] content,
java.lang.String encoding)
Extracts the text and meta information from the given binary document, using the specified content encoding.
|
I_CmsExtractionResult |
extractText(java.io.InputStream in)
Extracts the text and meta information from the document on the input stream.
|
protected CmsExtractionResult |
extractText(java.io.InputStream in,
org.apache.tika.parser.Parser parser)
Parses the given input stream with the provided parser and returns the result as a map of content items.
|
I_CmsExtractionResult |
extractText(java.io.InputStream in,
java.lang.String encoding)
Extracts the text and meta information from the document on the input stream, using the specified content encoding.
|
protected java.lang.String |
removeControlChars(java.lang.String content)
Removes "unwanted" control chars from the given content.
|
public I_CmsExtractionResult extractText(byte[] content) throws java.lang.Exception
I_CmsTextExtractor
The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided binary array automatically.
Delivers is the same result as calling
when I_CmsTextExtractor.extractText(byte[], String)
String == null
.
extractText
在接口中 I_CmsTextExtractor
content
- the binary content of the document to extract the text fromjava.lang.Exception
- if the text extration failsI_CmsTextExtractor.extractText(byte[])
public I_CmsExtractionResult extractText(byte[] content, java.lang.String encoding) throws java.lang.Exception
I_CmsTextExtractor
The encoding is a hint for the text extractor, if the value given is null
then
the text extractor should try to figure out the encoding itself.
extractText
在接口中 I_CmsTextExtractor
content
- the binary content of the document to extract the text fromencoding
- the encoding to usejava.lang.Exception
- if the text extration failsI_CmsTextExtractor.extractText(byte[], java.lang.String)
public I_CmsExtractionResult extractText(java.io.InputStream in) throws java.lang.Exception
I_CmsTextExtractor
The encoding of the input stream is either not required (the document type may have one common default encoding) or the extractor is able to divine the encoding from the provided input stream automatically.
Delivers is the same result as calling
when I_CmsTextExtractor.extractText(InputStream, String)
String == null
.
extractText
在接口中 I_CmsTextExtractor
in
- the input stream for the document to extract the text fromjava.lang.Exception
- if the text extration failsI_CmsTextExtractor.extractText(java.io.InputStream)
public I_CmsExtractionResult extractText(java.io.InputStream in, java.lang.String encoding) throws java.lang.Exception
I_CmsTextExtractor
The encoding is a hint for the text extractor, if the value given is null
then
the text extractor should try to figure out the encoding itself.
extractText
在接口中 I_CmsTextExtractor
in
- the input stream for the document to extract the text fromencoding
- the encoding to usejava.lang.Exception
- if the text extration failsI_CmsTextExtractor.extractText(java.io.InputStream, java.lang.String)
protected void combineContentItem(java.lang.String itemValue, java.lang.String itemKey, java.lang.StringBuffer content, java.util.Map<java.lang.String,java.lang.String> contentItems)
itemValue
- the value of the item to storeitemKey
- the key in the Map of content itemscontent
- a buffer where to append the content itemcontentItems
- the Map of individual content itemsprotected CmsExtractionResult extractText(java.io.InputStream in, org.apache.tika.parser.Parser parser) throws java.lang.Exception
in
- the input stream for the content to parseparser
- the parser to usejava.lang.Exception
- in case something goes wrongprotected java.lang.String removeControlChars(java.lang.String content)
content
- the content to remove the unwanted control chars from