亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定

C# 提取PDF中的表格

標簽:
C# .NET

本文介绍在C#程序中(附VB.NET代码)提取PDF中的表格的方法,调用Spire.PDF for .NET提供的提取表格的以及方法等来获取表格单元格中的文本内容;代码内容中涉及到的主要类及方法归纳如下表,供参考:


类型

描述

PdfDocument Class

Represents a pdf document model.

PdfDocument.LoadFromFile(string filename)   Method

Loads a PDF document.

PdfTableExtractor Class

Represents the PDF table extractor.

PdfTable Class

Defines a PDF table.

PdfTableExtractor. ExtractTable(int pageIndex) Method

Extracts table from page.

PdfTable.GetText(int rowIndex,int   columnIndex) Method

Gets Text in cell.

File.WriteAllText() Method

Saves extracted text in table to a .txt file.

环境配置

  •   Visual Studio 2017

  •   .net framework 4.6.1

  •   PDF测试文档

  •   库:Spire.PDF for .NET 7.10.4

引用dll文件的2种方法:

方法1:通过NuGet安装。

【步骤】

鼠标右键点击“引用”,“管理NuGet程序包”,

https://img3.sycdn.imooc.com/616f791d000192f303730447.jpg

点击“浏览”,在搜索框中输入,点击“安装”,

https://img1.sycdn.imooc.com/616f79c3000103b909730481.jpg

或者使用PM控制台安装:

PM>Install-Package Spire.PDF -Version 7.10.4

 

方法2:手动添加引用。

【步骤】

鼠标右键点击“引用”,“添加引用”,

https://img1.sycdn.imooc.com/616f79de0001fa2b03790429.jpg

点击“浏览”,“浏览”,将本地路径下的dll文件(需提前下载到本地,并解压)添加到引用列表

https://img1.sycdn.imooc.com/616f7a07000191be07660403.jpg

https://img4.sycdn.imooc.com/616f79f60001322113660728.jpg

点击OK,完成引用:

https://img2.sycdn.imooc.com/616f7a240001a61d10060403.jpg


代码示例

C#

using Spire.Pdf;

using Spire.Pdf.Utilities;

using System.IO;

using System.Text;

 

namespace ExtractTable

{

    class Program

    {

        static void Main(string[] args)

        {

            //加载PDF文档

            PdfDocument pdf = new PdfDocument();

            pdf.LoadFromFile("sample.pdf");

            StringBuilder builder = new StringBuilder();

 

            //抽取表格

            PdfTableExtractor extractor = new PdfTableExtractor(pdf);

            PdfTable[] tableLists = null;

            for (int pageIndex = 0; pageIndex < pdf.Pages.Count; pageIndex++)

            {

                tableLists = extractor.ExtractTable(pageIndex);

                if (tableLists != null && tableLists.Length > 0)

                {

                    foreach (PdfTable table in tableLists)

                    {

                        int row = table.GetRowCount();

                        int column = table.GetColumnCount();

                        for (int i = 0; i < row; i++)

                        {

                            for (int j = 0; j < column; j++)

                            {

                                string text = table.GetText(i, j);

                                builder.Append(text + " ");

                            }

                            builder.Append("\r\n");

                        }

                    }

                }

            }

 

            //保存提取的表格内容到txt文档

            File.WriteAllText("ExtractedTable.txt", builder.ToString());

        }

    }

}


VB.NET

Imports Spire.Pdf

Imports Spire.Pdf.Utilities

Imports System.IO

Imports System.Text

 

Namespace ExtractTable

    Class Program

        Private Shared Sub Main(args As String())

            '加载PDF文档

            Dim pdf As New PdfDocument()

            pdf.LoadFromFile("sample.pdf")

            Dim builder As New StringBuilder()

 

            '抽取表格

            Dim extractor As New PdfTableExtractor(pdf)

            Dim tableLists As PdfTable() = Nothing

            For pageIndex As Integer = 0 To pdf.Pages.Count - 1

                tableLists = extractor.ExtractTable(pageIndex)

                If tableLists IsNot Nothing AndAlso tableLists.Length > 0 Then

                    For Each table As PdfTable In tableLists

                        Dim row As Integer = table.GetRowCount()

                        Dim column As Integer = table.GetColumnCount()

                        For i As Integer = 0 To row - 1

                            For j As Integer = 0 To column - 1

                                Dim text As String = table.GetText(i, j)

                                builder.Append(text & Convert.ToString(" "))

                            Next

                            builder.Append(vbCr & vbLf)

                        Next

                    Next

                End If

            Next

 

            '保存提取的表格内容到txt文档

            File.WriteAllText("ExtractedTable.txt", builder.ToString())

        End Sub

    End Class

End Namespace

 

表格内容提取结果:

https://img3.sycdn.imooc.com/616f7a4c00016fdd12630569.jpg


其他注意事项:

代码中的PDF文件以及生成的.txt文件路径为F:\VS2017Project\ExtractTable\bin\Debug\sample.pdf和F:\VS2017Project\ ExtractTable\bin\Debug\ExtractedTable.txt。文件路径也可以自定义为其他路径。


———————————————————————————————————————————




點擊查看更多內容
TA 點贊

若覺得本文不錯,就分享一下吧!

評論

作者其他優質文章

正在加載中
JAVA開發工程師
手記
粉絲
9
獲贊與收藏
48

關注作者,訂閱最新文章

閱讀免費教程

  • 推薦
  • 評論
  • 收藏
  • 共同學習,寫下你的評論
感謝您的支持,我會繼續努力的~
掃碼打賞,你說多少就多少
贊賞金額會直接到老師賬戶
支付方式
打開微信掃一掃,即可進行掃碼打賞哦
今天注冊有機會得

100積分直接送

付費專欄免費學

大額優惠券免費領

立即參與 放棄機會
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號

舉報

0/150
提交
取消