1 回答

TA貢獻1829條經驗 獲得超13個贊
正如評論和答案中已經提到的,PDF 不是一種用于文本編輯的格式。它是最終格式,有關文本流、布局甚至到 Unicode 的映射的信息都是可選的。
因此,即使假設存在關于將字形映射到 Unicode 的可選信息,使用 iText 完成此任務的方法可能看起來有點不令人滿意:首先使用自定義文本提取策略確定相關文本的位置,然后繼續刪除該位置所有內容的當前內容PdfCleanUpProcessor
,最后將替換文本繪制到間隙中。
在這個答案中,我將提供一個幫助程序類,允許結合前兩個步驟,查找和刪除現有文本,其優點是實際上只刪除文本,而不是任何背景圖形等,就像PdfCleanUpProcessor
編輯的情況一樣。助手還返回被移除文本的位置,允許在其上標記替換。
helper 類基于此較早答案PdfContentStreamEditor
中提供的內容。不過,請使用github 上此類的版本,因為原始類自構想以來已得到一些增強。
helperSimpleTextRemover
類說明了從 PDF 中正確刪除文本所必需的內容。其實限制在幾個方面:
它只替換實際頁面內容流中的文本。
要同時替換嵌入式 XObject 中的文本,必須遞歸地遍歷相關頁面的 XObject 資源,并將編輯器應用于它們。
它的“簡單”方式與以下方式相同
SimpleTextExtractionStrategy
:它假定顯示說明的文本按閱讀順序出現在內容中。還要處理順序不同且指令必須排序的內容流,這意味著所有傳入指令和相關呈現信息必須緩存到頁面末尾,而不僅僅是一次幾個指令。然后可以對渲染信息進行排序,可以在排序后的渲染信息中標識要移除的部分,可以操縱相關聯的指令,并且最終可以存儲指令。
它不會嘗試識別在視覺上代表空白的字形之間的間隙,而實際上根本沒有字形。
要識別間隙,必須擴展代碼以檢查兩個連續的字形是否完全相繼,或者是否存在間隙或跳行。
在計算刪除字形的間隙時,它還沒有考慮字符和單詞的間距。
要改進這一點,必須改進字形寬度計算。
但是,考慮到您的內容流中的示例摘錄,這些限制可能不會妨礙您。
public class SimpleTextRemover extends PdfContentStreamEditor {
public SimpleTextRemover() {
super (new SimpleTextRemoverListener());
((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;
}
/**
* <p>Removes the string to remove from the given page of the
* document in the PDF reader the given PDF stamper works on.</p>
* <p>The result is a list of glyph lists each of which represents
* a match can can be queried for position information.</p>
*/
public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {
if (toRemove.length() == 0)
return Collections.emptyList();
this.toRemove = toRemove;
cachedOperations.clear();
elementNumber = -1;
pendingMatch.clear();
matches.clear();
allMatches.clear();
editPage(pdfStamper, pageNum);
return allMatches;
}
/**
* Adds the given operation to the cached operations and checks
* whether some cached operations can meanwhile be processed and
* written to the result content stream.
*/
@Override
protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
cachedOperations.add(new ArrayList<>(operands));
while (process(processor)) {
cachedOperations.remove(0);
}
}
/**
* Removes any started match and sends all remaining cached
* operations for processing.
*/
@Override
public void finalizeContent() {
pendingMatch.clear();
try {
while (!cachedOperations.isEmpty()) {
if (!process(this)) {
// TODO: Should not happen, so warn
System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));
}
cachedOperations.remove(0);
}
} catch (IOException e) {
throw new ExceptionConverter(e);
}
}
/**
* Tries to process the first cached operation. Returns whether
* it could be processed.
*/
boolean process(PdfContentStreamProcessor processor) throws IOException {
if (cachedOperations.isEmpty())
return false;
List<PdfObject> operands = cachedOperations.get(0);
PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);
String operatorString = operator.toString();
if (TEXT_SHOWING_OPERATORS.contains(operatorString))
return processTextShowingOp(processor, operator, operands);
super.write(processor, operator, operands);
return true;
}
/**
* Tries to processes a text showing operation. Unless a match
* is pending and starts before the end of the argument of this
* instruction, it can be processed. If the instructions contains
* a part of a match, it is transformed to a TJ operation and
* the glyphs in question are replaced by text position adjustments.
* If the original operation had a side effect (jump to next line
* or spacing adjustment), this side effect is explicitly added.
*/
boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
PdfObject object = operands.get(operands.size() - 2);
boolean isArray = object instanceof PdfArray;
PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);
int elementCount = countStrings(object);
// Currently pending glyph intersects parameter of this operation -> cannot yet process
if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)
return false;
// The parameter of this operation is subject to a match -> copy as is
if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {
super.write(processor, operator, operands);
processedElements += elementCount;
return true;
}
// The parameter of this operation contains glyphs of a match -> manipulate
PdfArray newArray = new PdfArray();
for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {
PdfObject entry = array.getPdfObject(arrayIndex);
if (!(entry instanceof PdfString)) {
newArray.add(entry);
} else {
PdfString entryString = (PdfString) entry;
byte[] entryBytes = entryString.getBytes();
for (int index = 0; index < entryBytes.length; ) {
List<Glyph> match = matches.size() == 0 ? null : matches.get(0);
Glyph glyph = match == null ? null : match.get(0);
if (glyph == null || processedElements < glyph.elementNumber) {
newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));
break;
}
if (index < glyph.index) {
newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));
index = glyph.index;
continue;
}
newArray.add(new PdfNumber(-glyph.width));
index++;
match.remove(0);
if (match.isEmpty())
matches.remove(0);
}
processedElements++;
}
}
writeSideEffect(processor, operator, operands);
writeTJ(processor, newArray);
return true;
}
/**
* Counts the strings in the given argument, itself a string or
* an array containing strings and non-strings.
*/
int countStrings(PdfObject textArgument) {
if (textArgument instanceof PdfArray) {
int result = 0;
for (PdfObject object : (PdfArray)textArgument) {
if (object instanceof PdfString)
result++;
}
return result;
} else
return textArgument instanceof PdfString ? 1 : 0;
}
/**
* Writes side effects of a text showing operation which is going to be
* replaced by a TJ operation. Side effects are line jumps and changes
* of character or word spacing.
*/
void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {
switch (operator.toString()) {
case "\"":
super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));
super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));
case "'":
super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));
}
}
/**
* Writes a TJ operation with the given array unless array is empty.
*/
void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {
if (!array.isEmpty()) {
List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);
super.write(processor, OPERATOR_TJ, operands);
}
}
/**
* Analyzes the given text render info whether it starts a new match or
* finishes / continues / breaks a pending match. This method is called
* by the {@link SimpleTextRemoverListener} registered as render listener
* of the underlying content stream processor.
*/
void renderText(TextRenderInfo renderInfo) {
elementNumber++;
int index = 0;
for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {
int matchPosition = pendingMatch.size();
pendingMatch.add(new Glyph(info, elementNumber, index));
if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {
reduceToPartialMatch();
}
if (pendingMatch.size() == toRemove.length()) {
matches.add(new ArrayList<>(pendingMatch));
allMatches.add(new ArrayList<>(pendingMatch));
pendingMatch.clear();
}
index++;
}
}
/**
* Reduces the current pending match to an actual (partial) match
* after the addition of the next glyph has invalidated it as a
* whole match.
*/
void reduceToPartialMatch() {
outer:
while (!pendingMatch.isEmpty()) {
pendingMatch.remove(0);
int index = 0;
for (Glyph glyph : pendingMatch) {
if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {
continue outer;
}
index++;
}
break;
}
}
String toRemove = null;
final List<List<PdfObject>> cachedOperations = new LinkedList<>();
int elementNumber = -1;
int processedElements = 0;
final List<Glyph> pendingMatch = new ArrayList<>();
final List<List<Glyph>> matches = new ArrayList<>();
final List<List<Glyph>> allMatches = new ArrayList<>();
/**
* Render listener class used by {@link SimpleTextRemover} as listener
* of its content stream processor ancestor. Essentially it forwards
* {@link TextRenderInfo} events and ignores all else.
*/
static class SimpleTextRemoverListener implements RenderListener {
@Override
public void beginTextBlock() { }
@Override
public void renderText(TextRenderInfo renderInfo) {
simpleTextRemover.renderText(renderInfo);
}
@Override
public void endTextBlock() { }
@Override
public void renderImage(ImageRenderInfo renderInfo) { }
SimpleTextRemover simpleTextRemover = null;
}
/**
* Value class representing a glyph with information on
* the displayed text and its position, the overall number
* of the string argument of a text showing instruction
* it is in and the index at which it can be found therein,
* and the width to use as text position adjustment when
* replacing it. Beware, the width does not yet consider
* character and word spacing!
*/
public static class Glyph {
public Glyph(TextRenderInfo info, int elementNumber, int index) {
text = info.getText();
ascent = info.getAscentLine();
base = info.getBaseline();
descent = info.getDescentLine();
this.elementNumber = elementNumber;
this.index = index;
this.width = info.getFont().getWidth(text);
}
public final String text;
public final LineSegment ascent;
public final LineSegment base;
public final LineSegment descent;
final int elementNumber;
final int index;
final float width;
}
final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");
final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");
final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");
final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");
final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");
final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");
final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];
}
( SimpleTextRemover輔助類)
你可以像這樣使用它:
PdfReader pdfReader = new PdfReader(SOURCE);
PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);
SimpleTextRemover remover = new SimpleTextRemover();
System.out.printf("\ntest.pdf - Test\n");
for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)
{
System.out.printf("Page %d:\n", i);
List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");
for (List<Glyph> match : matches) {
Glyph first = match.get(0);
Vector baseStart = first.base.getStartPoint();
Glyph last = match.get(match.size()-1);
Vector baseEnd = last.base.getEndPoint();
System.out.printf(" Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));
}
}
pdfStamper.close();
(移除頁面文本內容測試testRemoveTestFromTest)
我的測試文件有以下控制臺輸出:
test.pdf - Test
Page 1:
Match from (134,8 666,9) to (177,8 666,9)
Match from (134,8 642,0) to (153,4 642,0)
Match from (172,8 642,0) to (191,4 642,0)
以及輸出 PDF 中那些位置缺少“測試”的情況。
您可以使用它們在相關位置繪制替換文本,而不是輸出匹配坐標。

TA貢獻1799條經驗 獲得超9個贊
PDF 文件不是文字處理文件。您看到的是字符的顯式放置,這些字符緊貼在一起和/或許多其他東西。您夢想以這種方式“替換”文本是不可能的,或者說更好,即使不是不可能,也不太可能。
PDF 是具有字節偏移量的二進制文件。它有很多部分。就像這是在這個字節偏移量處讀取這個,然后去那個字節偏移量并讀取那個。
您不能只是將“foo”替換為“foobar”并認為它會起作用。它會破壞所有字節偏移并完全破壞文件。
在詢問之前自己嘗試一下。
在你上面的例子中,在一些編輯器中打開文件并更改你發布的字符串:
This is a
對此:
WOW Let me change this data around for the content "This is a"
保存該文件并嘗試打開它。即便如此,這是一組不跨越您確定的邊界的內容也不會起作用。因為它不是文字處理文件。它不是文本文件。它是一個二進制文件,您無法像您認為的那樣對其進行操作。
添加回答
舉報