首頁猿問如何刪除Python和pdfmin...

如何刪除Python和pdfminer中的單個或可行的單詞形式列表無法隱藏盧比字體

Python

米琪卡哇伊 2023-06-27 17:37:21

我正在從 PDF 中提取文本并將其轉換為 HTML。當我們在 BeautifulSoup 的幫助下從 Html 中提取文本時。我遇到了貨幣（盧比符號）等符號的問題。盧比符號就像蒂爾達 ['``']['Amid', '41'], ['``', '41'], ['3L cr 短缺，GST 流程將持續到 2022 年 6 月之后', '41'], ['Cong 剪掉了寫信人的翅膀 ? 在新任命中 ', '32'] ,['MVA 旨在削減政府選擇風險投資人的權力 ', '28']}當前輸出 1. Amid 2. 3L cr shortfall, GST cess to continue beyond June 2022 3. Cong clips wings of ‘letter writers’ in new appointments 4. MVA aims to cut guv’s power to choose VC我想要輸出具有更高字體大小的文本，并且還想刪除列表中的單行字符，例如 [['``', '41']我想要的輸出應該是這樣的 1. Amid 3L cr shortfall, GST cess to continue beyond June 2022 2. Cong clips wings of ‘letter writers’ in new appointments 3. Cong clips wings of ‘letter writers’ in new appointments 我的完整代碼：import sys,os,re,operator,tempfile,fileinputfrom bs4 import BeautifulSoup,Tag,UnicodeDammitfrom io import StringIOfrom pdfminer.layout import LAParamsfrom pdfminer.high_level import extract_text_to_fpdef convert_html(filename): output = StringIO() with open(filename, 'rb') as fin: extract_text_to_fp(fin, output, laparams=LAParams(),output_type='html', codec=None) Out_txt=output.getvalue() return Out_txtdef get_the_start_of_font(x,attr): """ Return the index of the 'font-size' first occurrence or None. """ match = re.search(x, attr) if match is not None: return match.start() return None def get_font_size_from(attr): """ Return the font size as string or None if not found. """ font_start_i = get_the_start_of_font('font-size:',attr) if font_start_i is not None: font_size=str(attr[font_start_i + len('font-size:'):].split('px')[0]) if int(font_size)>25: return font_size

查看完整描述

2 回答

湖上湖

TA貢獻2003條經驗獲得超2個贊

headlines = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],

? ? ? ? ? ? ?['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k ','28'],

? ? ? ? ? ? ?['O', '33'],

? ? ? ? ? ? ?['Don’t hide behind RBI on loan interest waiver: SC to govt ', '28']]

for idx, line in enumerate(sorted([row for row in headlines if len(row[0]) > 1], key=lambda z: int(z[1]), reverse=True)):

? ? print("{}. {}".format(idx+1, line[0]))

輸出：

1. In bid to boost realty, state cuts stamp duty for 7 mths

2. India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k

3. Don’t hide behind RBI on loan interest waiver: SC to govt

上面發生的事情的細分：

[row for row in headlines if len(row[0]) > 1]

headlines如果的長度entry_in_headlines[0]大于 1，這將創建一個新列表，其中包含所有條目。

sorted(<iterable>, key=lambda z: int(z[1]), reverse=True)

將使用 lambda 函數對給定的可迭代對象進行排序，該函數采用一個參數，并以整數形式返回該變量的第二個索引。然后反轉結果，由于reverse=True.

for idx, line in enumerate(<iterable>):

循環enumerate將返回它被調用的次數的“計數”，以及迭代器內的下一個值。

print("{}. {}".format(idx+1, line[0]))

使用字符串格式化，我們在 for 循環內創建新字符串。

反對回復 2023-06-27

呼如林

TA貢獻1798條經驗獲得超3個贊

我無法真正弄清楚您正在嘗試什么或您的數據在哪里，但您需要添加一個 if 語句。

例如：

data = ['In bid to boost realty, state cuts stamp duty for 7 mths ', '42']

if len(data[0].split()) >= 2:

print(data[0])

任何 2 個字或更少的語句都不會被打印。

如果您有一個列表列表：

data = [['In bid to boost realty, state cuts stamp duty for 7 mths ', '42'],

['India sees world’s third-biggest spike of 76,000+ cases, toll crosses 60k',

'28'], ['O', '33'], ['Don’t hide behind RBI on loan interest waiver: SC to

govt ', '28']]

for lists in data:

if len(lists[0].split()) <= 2:

data.remove(lists)

print(*("".join(lists[0]) for lists in data), sep='\n')

反對回復 2023-06-27

2 回答
0 關注
197 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何刪除Python和pdfminer中的單個或可行的單詞形式列表無法隱藏盧比字體

如何刪除Python和pdfminer中的單個或可行的單詞形式列表無法隱藏盧比字體

2 回答

添加回答