首頁猿問如何使用 Python 刪除...

如何使用 Python 刪除 JavaScript 和其他標簽...而不導入模塊

Python

神不在的星期二 2023-09-19 14:15:07

對于學校項目的第一部分，我試圖弄清楚如何刪除 JavaScript<script {...} >和</script {...} >標簽以及<和之間的任何內容>。然而，我們無法導入任何模塊（甚至是Python內置的模塊），因為顯然標記可能無法訪問它們等等。我試過這個：text = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"while text.find("<script") >= 0: script_start = text.find("<script") script_end = text.find(">", text.find("</script")) + 1 text = text[:script_start] + text[script_end:]while text.find("<") >= 0: script2_start = text.find("<") script2_end = text.find(">") + 1 text = text[:script2_start] + text[script2_end:]這確實適用于較小的文件，但該項目與大文本文件有關（我們給出的簡化測試文件是 10.4MB），因此它不會完成并且會卡住。有人有任何想法可以提高效率嗎？

查看完整描述

3 回答

大話西游666

TA貢獻1817條經驗獲得超14個贊

您不需要刪除任何內容。事實上：你永遠不想修改字符串。

字符串是不可變的：每次“修改”字符串時，您都會創建一個新字符串并丟棄舊字符串。這是對處理器和內存的浪費。

您正在對文件進行操作 - 因此請按字符方式處理它：

記住你是否在<...>里面
如果是這樣，唯一重要的特征就是 >再次出去
如果外面和字符是<你進入里面并忽略該字符
如果在外部而不是在外部，<則將字符寫入輸出（-file）

# create file

with open("somefile.txt","w") as f:

# up the multiplicator to 10000000 to create something in the megabyte range

f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)

# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

# starting outside

inside = False

# we iterate the file line by line

for line in f:

# and each line characterwise

for c in line:

if not inside and c == "<":

inside = True

elif inside and c != ">":

continue

elif inside and c == ">":

inside = False

elif not inside:

# only case to write to out

out.write(c)

print(open("somefile.txt").read() + "\n")

print(open("otherfile.txt").read())

輸出：

hello hello hey tata

如果不允許直接操作文件，請將文件讀入消耗 11+Mbyte 內存的列表中：

data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)

result = []

inside = False

for c in data:

if inside:

if c == ">":

inside = False

# else ignore c - because we are inside

elif c == "<":

inside = True

else:

result.append(c)

print(''.join(result))

這仍然比迭代搜索列表中第一次出現的“<”更好，但可能需要最多兩倍的源內存（如果它不包含任何 <..>，則將列表加倍）。

操作文件比進行任何就地列表修改（這將是第三種方法）的內存效率要高得多。

您還需要解決一些明顯的問題，例如

var i = 10;

if (i < 5) {

// some code

}

</script>

會將“代碼”留在里面。

這可能會解決更簡單的極端情況：

# open file to read from and file to write to

with open("somefile.txt") as f, open("otherfile.txt","w") as out:

# starting outside

inside = False

insideJS = False

jsStart = 0

# we iterate the file line by line

for line in f:

# string manipulation :/ - will remove <script ...> .. </script ..>

# even over multiple lines - probably missed some cornercases.

while True:

if insideJS and not "</script" in line:

line = ""

break

if "<script" in line:

insideJS = True

jsStart = line.index("<script")

jsEnd = len(line)

elif insideJS:

jsStart = 0

if not insideJS:

break

if "</script" in line:

jsEnd = line.index(">", line.index("</script", jsStart))+1

line = line[:jsStart] + line[jsEnd:]

insideJS = False

else:

line = line[:jsStart]

# and each line characterwise

for c in line:

# ... same as above ...

反對回復 2023-09-19

偶然的你

TA貢獻1841條經驗獲得超3個贊

即使有2個while循環，它仍然是線性復雜度

string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"

new_string = ''

i = 0

while i < len(string):

if string[i] == "<":

while i < len(string):

i += 1

if string[i] == '>':

break

else:

new_string += string[i]

i += 1

print(new_string)

輸出：

hello hello hey

反對回復 2023-09-19

呼喚遠方

TA貢獻1856條經驗獲得超11個贊

以下是FSA的一種方法：

output = ''

NORMAL, INSIDE_TAG = range(2) # availale states

state = NORMAL # start with normal state

s = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'

for char in s:

? if char == '<': # if we encounter '<' we enter the INSIDE_TAG state

? ? state = INSIDE_TAG

? ? continue

? elif char == '>': # we can safely exit the INSIDE_TAG state

? ? state = NORMAL

? ? continue

? if state == NORMAL:

? ? output += char? # add the char to the output only if we are in normal state

print(output)

如果需要解析標簽語義，請確保使用堆棧（可以使用實現list）。

這會增加復雜性，但您可以使用 FSM 實現可靠的檢查。

請參見以下示例：

output = ''

(

? NORMAL,

? TAG_ATTRIBUTE,

? INSIDE_JAVASCRIPT,

? EXITING_TAG,

? BEFORE_TAG_OPENING_OR_ENDING,

? TAG_NAME,

? ABOUT_TO_EXIT_JS

) = range(7) # availale states

state = NORMAL # start with normal state

tag_name = ''

s = """

? var i = 10;

? if (i < 5) {

? ? // some code

? }

</script>

? test string

? <a > another string</a>

</sometag>

"""

for char in s:

? # print(char, '-', state, ':', tag_name)

? if state == NORMAL:

? ? if char == '<':

? ? ? state = BEFORE_TAG_OPENING_OR_ENDING

? ? else:

? ? ? output += char

? elif state == BEFORE_TAG_OPENING_OR_ENDING:

? ? if char == '/':

? ? ? state = EXITING_TAG

? ? else:

? ? ? tag_name += char

? ? ? state = TAG_NAME

? elif state == TAG_ATTRIBUTE:

? ? if char == '>':

? ? ? if tag_name == 'script':

? ? ? ? state = INSIDE_JAVASCRIPT

? ? ? else:

? ? ? ? state = NORMAL

? elif state == TAG_NAME:

? ? if char == ' ':

? ? ? state = TAG_ATTRIBUTE

? ? elif char == '>':

? ? ? if tag_name == 'script':

? ? ? ? state = INSIDE_JAVASCRIPT

? ? ? else:

? ? ? ? state = NORMAL

? ? else:

? ? ? tag_name += char

? elif state == INSIDE_JAVASCRIPT:

? ? if char == '<':

? ? ? state = ABOUT_TO_EXIT_JS

? ? else:

? ? ? pass

? ? ? # output += char

? elif state == ABOUT_TO_EXIT_JS:

? ? if char == '/':

? ? ? state = EXITING_TAG

? ? ? tag_name = ''

? ? else:

? ? ? # output += '<'

? ? ? state = INSIDE_JAVASCRIPT

? elif state == EXITING_TAG:

? ? if char == '>':

? ? ? state = NORMAL

print(output)

輸出：

? test string

? another string

反對回復 2023-09-19

3 回答
0 關注
163 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何使用 Python 刪除 JavaScript 和其他標簽...而不導入模塊

如何使用 Python 刪除 JavaScript 和其他標簽...而不導入模塊

3 回答

添加回答