亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

使用 iText 替換 PDF 文件中的文本

使用 iText 替換 PDF 文件中的文本

開心每一天1111 2023-05-10 17:31:37
我正在使用iText(5.5.13)庫讀取 .PDF 并替換文件中的模式。問題是沒有找到模式,因為當圖書館讀取 pdf 時,不知何故會出現一些奇怪的字符。例如,在句子中:"This is a test in order to see if the"當我試圖閱讀它時變成了這個:[(This is a )9(te)-3(st)9( in o)-4(rd)15(er )-2(t)9(o)-5( s)8(ee)7( if t)-3(h)3(e )]因此,如果我嘗試查找和替換"test",則不會"test"在 pdf 中找到任何單詞,也不會被替換這是我正在使用的代碼:public void processPDF(String src, String dest) {    try {      PdfReader reader = new PdfReader(src);      PdfArray refs = null;      PRIndirectReference reference = null;      int nPages = reader.getNumberOfPages();      for (int i = 1; i <= nPages; i++) {        PdfDictionary dict = reader.getPageN(i);        PdfObject object = dict.getDirectObject(PdfName.CONTENTS);        if (object.isArray()) {          refs = dict.getAsArray(PdfName.CONTENTS);          ArrayList<PdfObject> references = refs.getArrayList();          for (PdfObject r : references) {            reference = (PRIndirectReference) r;            PRStream stream = (PRStream) PdfReader.getPdfObject(reference);            byte[] data = PdfReader.getStreamBytes(stream);            String dd = new String(data, "UTF-8");            dd = dd.replaceAll("@pattern_1234", "trueValue");            dd = dd.replaceAll("test", "tested");            stream.setData(dd.getBytes());          }        }        if (object instanceof PRStream) {          PRStream stream = (PRStream) object;          byte[] data = PdfReader.getStreamBytes(stream);          String dd = new String(data, "UTF-8");          System.out.println("content---->" + dd);          dd = dd.replaceAll("@pattern_1234", "trueValue");          dd = dd.replaceAll("This", "FIRST");          stream.setData(dd.getBytes(StandardCharsets.UTF_8));        }      }      PdfStamper stamper = new PdfStamper(reader, new FileOutputStream(dest));      stamper.close();      reader.close();    }    catch (Exception e) {    }  }
查看完整描述

1 回答

?
烙印99

TA貢獻1829條經驗 獲得超13個贊

正如評論和答案中已經提到的,PDF 不是一種用于文本編輯的格式。它是最終格式,有關文本流、布局甚至到 Unicode 的映射的信息都是可選的。

因此,即使假設存在關于將字形映射到 Unicode 的可選信息,使用 iText 完成此任務的方法可能看起來有點不令人滿意:首先使用自定義文本提取策略確定相關文本的位置,然后繼續刪除該位置所有內容的當前內容PdfCleanUpProcessor,最后將替換文本繪制到間隙中。

在這個答案中,我將提供一個幫助程序類,允許結合前兩個步驟,查找和刪除現有文本,其優點是實際上只刪除文本,不是任何背景圖形等,就像PdfCleanUpProcessor編輯的情況一樣。助手還返回被移除文本的位置,允許在其上標記替換。

helper 類基于此較早答案PdfContentStreamEditor中提供的內容。不過,請使用github 上此類的版本,因為原始類自構想以來已得到一些增強。

helperSimpleTextRemover類說明了從 PDF 中正確刪除文本所必需的內容。其實限制在幾個方面:

  • 它只替換實際頁面內容流中的文本。

    要同時替換嵌入式 XObject 中的文本,必須遞歸地遍歷相關頁面的 XObject 資源,并將編輯器應用于它們。

  • 它的“簡單”方式與以下方式相同SimpleTextExtractionStrategy:它假定顯示說明的文本按閱讀順序出現在內容中。

    還要處理順序不同且指令必須排序的內容流,這意味著所有傳入指令和相關呈現信息必須緩存到頁面末尾,而不僅僅是一次幾個指令。然后可以對渲染信息進行排序,可以在排序后的渲染信息中標識要移除的部分,可以操縱相關聯的指令,并且最終可以存儲指令。

  • 它不會嘗試識別在視覺上代表空白的字形之間的間隙,而實際上根本沒有字形。

    要識別間隙,必須擴展代碼以檢查兩個連續的字形是否完全相繼,或者是否存在間隙或跳行。

  • 在計算刪除字形的間隙時,它還沒有考慮字符和單詞的間距。

    要改進這一點,必須改進字形寬度計算。

但是,考慮到您的內容流中的示例摘錄,這些限制可能不會妨礙您。

public class SimpleTextRemover extends PdfContentStreamEditor {

    public SimpleTextRemover() {

        super (new SimpleTextRemoverListener());

        ((SimpleTextRemoverListener)getRenderListener()).simpleTextRemover = this;

    }


    /**

     * <p>Removes the string to remove from the given page of the

     * document in the PDF reader the given PDF stamper works on.</p>

     * <p>The result is a list of glyph lists each of which represents

     * a match can can be queried for position information.</p>

     */

    public List<List<Glyph>> remove(PdfStamper pdfStamper, int pageNum, String toRemove) throws IOException {

        if (toRemove.length()  == 0)

            return Collections.emptyList();


        this.toRemove = toRemove;

        cachedOperations.clear();

        elementNumber = -1;

        pendingMatch.clear();

        matches.clear();

        allMatches.clear();

        editPage(pdfStamper, pageNum);

        return allMatches;

    }


    /**

     * Adds the given operation to the cached operations and checks

     * whether some cached operations can meanwhile be processed and

     * written to the result content stream.

     */

    @Override

    protected void write(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {

        cachedOperations.add(new ArrayList<>(operands));


        while (process(processor)) {

            cachedOperations.remove(0);

        }

    }


    /**

     * Removes any started match and sends all remaining cached

     * operations for processing.

     */

    @Override

    public void finalizeContent() {

        pendingMatch.clear();

        try {

            while (!cachedOperations.isEmpty()) {

                if (!process(this)) {

                    // TODO: Should not happen, so warn

                    System.err.printf("Failure flushing operation %s; dropping.\n", cachedOperations.get(0));

                }

                cachedOperations.remove(0);

            }

        } catch (IOException e) {

            throw new ExceptionConverter(e);

        }

    }


    /**

     * Tries to process the first cached operation. Returns whether

     * it could be processed.

     */

    boolean process(PdfContentStreamProcessor processor) throws IOException {

        if (cachedOperations.isEmpty())

            return false;


        List<PdfObject> operands = cachedOperations.get(0);

        PdfLiteral operator = (PdfLiteral) operands.get(operands.size() - 1);

        String operatorString = operator.toString();


        if (TEXT_SHOWING_OPERATORS.contains(operatorString))

            return processTextShowingOp(processor, operator, operands);


        super.write(processor, operator, operands);

        return true;

    }


    /**

     * Tries to processes a text showing operation. Unless a match

     * is pending and starts before the end of the argument of this

     * instruction, it can be processed. If the instructions contains

     * a part of a match, it is transformed to a TJ operation and

     * the glyphs in question are replaced by text position adjustments.

     * If the original operation had a side effect (jump to next line

     * or spacing adjustment), this side effect is explicitly added.

     */

    boolean processTextShowingOp(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {

        PdfObject object = operands.get(operands.size() - 2);

        boolean isArray = object instanceof PdfArray;

        PdfArray array = isArray ? (PdfArray) object : new PdfArray(object);

        int elementCount = countStrings(object);


        // Currently pending glyph intersects parameter of this operation -> cannot yet process

        if (!pendingMatch.isEmpty() && pendingMatch.get(0).elementNumber < processedElements + elementCount)

            return false;


        // The parameter of this operation is subject to a match -> copy as is

        if (matches.size() == 0 || processedElements + elementCount <= matches.get(0).get(0).elementNumber || elementCount == 0) {

            super.write(processor, operator, operands);

            processedElements += elementCount;

            return true;

        }


        // The parameter of this operation contains glyphs of a match -> manipulate 

        PdfArray newArray = new PdfArray();

        for (int arrayIndex = 0; arrayIndex < array.size(); arrayIndex++) {

            PdfObject entry = array.getPdfObject(arrayIndex);

            if (!(entry instanceof PdfString)) {

                newArray.add(entry);

            } else {

                PdfString entryString = (PdfString) entry;

                byte[] entryBytes = entryString.getBytes();

                for (int index = 0; index < entryBytes.length; ) {

                    List<Glyph> match = matches.size() == 0 ? null : matches.get(0);

                    Glyph glyph = match == null ? null : match.get(0);

                    if (glyph == null || processedElements < glyph.elementNumber) {

                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, entryBytes.length)));

                        break;

                    }

                    if (index < glyph.index) {

                        newArray.add(new PdfString(Arrays.copyOfRange(entryBytes, index, glyph.index)));

                        index = glyph.index;

                        continue;

                    }

                    newArray.add(new PdfNumber(-glyph.width));

                    index++;

                    match.remove(0);

                    if (match.isEmpty())

                        matches.remove(0);

                }

                processedElements++;

            }

        }

        writeSideEffect(processor, operator, operands);

        writeTJ(processor, newArray);


        return true;

    }


    /**

     * Counts the strings in the given argument, itself a string or

     * an array containing strings and non-strings.

     */

    int countStrings(PdfObject textArgument) {

        if (textArgument instanceof PdfArray) {

            int result = 0;

            for (PdfObject object : (PdfArray)textArgument) {

                if (object instanceof PdfString)

                    result++;

            }

            return result;

        } else 

            return textArgument instanceof PdfString ? 1 : 0;

    }


    /**

     * Writes side effects of a text showing operation which is going to be

     * replaced by a TJ operation. Side effects are line jumps and changes

     * of character or word spacing.

     */

    void writeSideEffect(PdfContentStreamProcessor processor, PdfLiteral operator, List<PdfObject> operands) throws IOException {

        switch (operator.toString()) {

        case "\"":

            super.write(processor, OPERATOR_Tw, Arrays.asList(operands.get(0), OPERATOR_Tw));

            super.write(processor, OPERATOR_Tc, Arrays.asList(operands.get(1), OPERATOR_Tc));

        case "'":

            super.write(processor, OPERATOR_Tasterisk, Collections.singletonList(OPERATOR_Tasterisk));

        }

    }


    /**

     * Writes a TJ operation with the given array unless array is empty.

     */

    void writeTJ(PdfContentStreamProcessor processor, PdfArray array) throws IOException {

        if (!array.isEmpty()) {

            List<PdfObject> operands = Arrays.asList(array, OPERATOR_TJ);

            super.write(processor, OPERATOR_TJ, operands);

        }

    }


    /**

     * Analyzes the given text render info whether it starts a new match or

     * finishes / continues / breaks a pending match. This method is called

     * by the {@link SimpleTextRemoverListener} registered as render listener

     * of the underlying content stream processor.

     */

    void renderText(TextRenderInfo renderInfo) {

        elementNumber++;

        int index = 0;

        for (TextRenderInfo info : renderInfo.getCharacterRenderInfos()) {

            int matchPosition = pendingMatch.size();

            pendingMatch.add(new Glyph(info, elementNumber, index));

            if (!toRemove.substring(matchPosition, matchPosition + info.getText().length()).equals(info.getText())) {

                reduceToPartialMatch();

            }

            if (pendingMatch.size() == toRemove.length()) {

                matches.add(new ArrayList<>(pendingMatch));

                allMatches.add(new ArrayList<>(pendingMatch));

                pendingMatch.clear();

            }

            index++;

        }

    }


    /**

     * Reduces the current pending match to an actual (partial) match

     * after the addition of the next glyph has invalidated it as a

     * whole match.

     */

    void reduceToPartialMatch() {

        outer:

        while (!pendingMatch.isEmpty()) {

            pendingMatch.remove(0);

            int index = 0;

            for (Glyph glyph : pendingMatch) {

                if (!toRemove.substring(index, index + glyph.text.length()).equals(glyph.text)) {

                    continue outer;

                }

                index++;

            }

            break;

        }

    }


    String toRemove = null;

    final List<List<PdfObject>> cachedOperations = new LinkedList<>();


    int elementNumber = -1;

    int processedElements = 0;

    final List<Glyph> pendingMatch = new ArrayList<>();

    final List<List<Glyph>> matches = new ArrayList<>();

    final List<List<Glyph>> allMatches = new ArrayList<>();


    /**

     * Render listener class used by {@link SimpleTextRemover} as listener

     * of its content stream processor ancestor. Essentially it forwards

     * {@link TextRenderInfo} events and ignores all else.

     */

    static class SimpleTextRemoverListener implements RenderListener {

        @Override

        public void beginTextBlock() { }


        @Override

        public void renderText(TextRenderInfo renderInfo) {

            simpleTextRemover.renderText(renderInfo);

        }


        @Override

        public void endTextBlock() { }


        @Override

        public void renderImage(ImageRenderInfo renderInfo) { }


        SimpleTextRemover simpleTextRemover = null;

    }


    /**

     * Value class representing a glyph with information on

     * the displayed text and its position, the overall number

     * of the string argument of a text showing instruction

     * it is in and the index at which it can be found therein,

     * and the width to use as text position adjustment when

     * replacing it. Beware, the width does not yet consider

     * character and word spacing!

     */

    public static class Glyph {

        public Glyph(TextRenderInfo info, int elementNumber, int index) {

            text = info.getText();

            ascent = info.getAscentLine();

            base = info.getBaseline();

            descent = info.getDescentLine();

            this.elementNumber = elementNumber;

            this.index = index;

            this.width = info.getFont().getWidth(text);

        }


        public final String text;

        public final LineSegment ascent;

        public final LineSegment base;

        public final LineSegment descent;

        final int elementNumber;

        final int index;

        final float width;

    }


    final PdfLiteral OPERATOR_Tasterisk = new PdfLiteral("T*");

    final PdfLiteral OPERATOR_Tc = new PdfLiteral("Tc");

    final PdfLiteral OPERATOR_Tw = new PdfLiteral("Tw");

    final PdfLiteral OPERATOR_Tj = new PdfLiteral("Tj");

    final PdfLiteral OPERATOR_TJ = new PdfLiteral("TJ");

    final static List<String> TEXT_SHOWING_OPERATORS = Arrays.asList("Tj", "'", "\"", "TJ");

    final static Glyph[] EMPTY_GLYPH_ARRAY = new Glyph[0];

}

( SimpleTextRemover輔助類)


你可以像這樣使用它:


PdfReader pdfReader = new PdfReader(SOURCE);

PdfStamper pdfStamper = new PdfStamper(pdfReader, RESULT_STREAM);

SimpleTextRemover remover = new SimpleTextRemover();


System.out.printf("\ntest.pdf - Test\n");

for (int i = 1; i <= pdfReader.getNumberOfPages(); i++)

{

    System.out.printf("Page %d:\n", i);

    List<List<Glyph>> matches = remover.remove(pdfStamper, i, "Test");

    for (List<Glyph> match : matches) {

        Glyph first = match.get(0);

        Vector baseStart = first.base.getStartPoint();

        Glyph last = match.get(match.size()-1);

        Vector baseEnd = last.base.getEndPoint();

        System.out.printf("  Match from (%3.1f %3.1f) to (%3.1f %3.1f)\n", baseStart.get(I1), baseStart.get(I2), baseEnd.get(I1), baseEnd.get(I2));

    }

}


pdfStamper.close();

(移除頁面文本內容測試testRemoveTestFromTest)


我的測試文件有以下控制臺輸出:


test.pdf - Test

Page 1:

  Match from (134,8 666,9) to (177,8 666,9)

  Match from (134,8 642,0) to (153,4 642,0)

  Match from (172,8 642,0) to (191,4 642,0)

以及輸出 PDF 中那些位置缺少“測試”的情況。


您可以使用它們在相關位置繪制替換文本,而不是輸出匹配坐標。


查看完整回答
反對 回復 2023-05-10
?
揚帆大魚

TA貢獻1799條經驗 獲得超9個贊

PDF 文件不是文字處理文件。您看到的是字符的顯式放置,這些字符緊貼在一起和/或許多其他東西。您夢想以這種方式“替換”文本是不可能的,或者說更好,即使不是不可能,也不太可能。

PDF 是具有字節偏移量的二進制文件。它有很多部分。就像這是在這個字節偏移量處讀取這個,然后去那個字節偏移量并讀取那個。

您不能只是將“foo”替換為“foobar”并認為它會起作用。它會破壞所有字節偏移并完全破壞文件。

在詢問之前自己嘗試一下。

在你上面的例子中,在一些編輯器中打開文件并更改你發布的字符串:

This is a

對此:

WOW Let me change this data around for the content "This is a"

保存該文件并嘗試打開它。即便如此,這是一組不跨越您確定的邊界的內容也不會起作用。因為它不是文字處理文件。它不是文本文件。它是一個二進制文件,您無法像您認為的那樣對其進行操作。


查看完整回答
反對 回復 2023-05-10
  • 1 回答
  • 0 關注
  • 986 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號