3 回答

TA貢獻1817條經驗 獲得超14個贊
您不需要刪除任何內容。事實上:你永遠不想修改字符串。
字符串是不可變的:每次“修改”字符串時,您都會創建一個新字符串并丟棄舊字符串。這是對處理器和內存的浪費。
您正在對文件進行操作 - 因此請按字符方式處理它:
記住你是否在
<...>
里面如果是這樣,唯一重要的特征就是
>
再次出去如果外面和字符是
<
你進入里面并忽略該字符如果在外部而不是在外部,
<
則將字符寫入輸出(-file)
# create file
with open("somefile.txt","w") as f:
# up the multiplicator to 10000000 to create something in the megabyte range
f.write("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n"*10)
# open file to read from and file to write to
with open("somefile.txt") as f, open("otherfile.txt","w") as out:
# starting outside
inside = False
# we iterate the file line by line
for line in f:
# and each line characterwise
for c in line:
if not inside and c == "<":
inside = True
elif inside and c != ">":
continue
elif inside and c == ">":
inside = False
elif not inside:
# only case to write to out
out.write(c)
print(open("somefile.txt").read() + "\n")
print(open("otherfile.txt").read())
輸出:
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
hello hello hey tata
如果不允許直接操作文件,請將文件讀入消耗 11+Mbyte 內存的列表中:
data = list("<script beep beep> hello </script boop doop woop> hello <hi> hey <bye> tata\n" * 10)
result = []
inside = False
for c in data:
if inside:
if c == ">":
inside = False
# else ignore c - because we are inside
elif c == "<":
inside = True
else:
result.append(c)
print(''.join(result))
這仍然比迭代搜索列表中第一次出現的“<”更好,但可能需要最多兩倍的源內存(如果它不包含任何 <..>,則將列表加倍)。
操作文件比進行任何就地列表修改(這將是第三種方法)的內存效率要高得多。
您還需要解決一些明顯的問題,例如
<script type="text/javascript">
var i = 10;
if (i < 5) {
// some code
}
</script>
會將“代碼”留在里面。
這可能會解決更簡單的極端情況:
# open file to read from and file to write to
with open("somefile.txt") as f, open("otherfile.txt","w") as out:
# starting outside
inside = False
insideJS = False
jsStart = 0
# we iterate the file line by line
for line in f:
# string manipulation :/ - will remove <script ...> .. </script ..>
# even over multiple lines - probably missed some cornercases.
while True:
if insideJS and not "</script" in line:
line = ""
break
if "<script" in line:
insideJS = True
jsStart = line.index("<script")
jsEnd = len(line)
elif insideJS:
jsStart = 0
if not insideJS:
break
if "</script" in line:
jsEnd = line.index(">", line.index("</script", jsStart))+1
line = line[:jsStart] + line[jsEnd:]
insideJS = False
else:
line = line[:jsStart]
# and each line characterwise
for c in line:
# ... same as above ...

TA貢獻1841條經驗 獲得超3個贊
即使有2個while循環,它仍然是線性復雜度
string = "<script beep beep> hello </script boop doop woop> hello <hi> hey <bye>"
new_string = ''
i = 0
while i < len(string):
if string[i] == "<":
while i < len(string):
i += 1
if string[i] == '>':
break
else:
new_string += string[i]
i += 1
print(new_string)
輸出:
hello hello hey

TA貢獻1856條經驗 獲得超11個贊
以下是FSA的一種方法:
output = ''
NORMAL, INSIDE_TAG = range(2) # availale states
state = NORMAL # start with normal state
s = '<script beep beep> hello </script boop doop woop> hello <hi id="someid" class="some class"><a> hey </a><bye>'
for char in s:
? if char == '<': # if we encounter '<' we enter the INSIDE_TAG state
? ? state = INSIDE_TAG
? ? continue
? elif char == '>': # we can safely exit the INSIDE_TAG state
? ? state = NORMAL
? ? continue
? if state == NORMAL:
? ? output += char? # add the char to the output only if we are in normal state
print(output)
如果需要解析標簽語義,請確保使用堆棧(可以使用 實現list)。
這會增加復雜性,但您可以使用 FSM 實現可靠的檢查。
請參見以下示例:
output = ''
(
? NORMAL,
? TAG_ATTRIBUTE,
? INSIDE_JAVASCRIPT,
? EXITING_TAG,
? BEFORE_TAG_OPENING_OR_ENDING,
? TAG_NAME,
? ABOUT_TO_EXIT_JS
) = range(7) # availale states
state = NORMAL # start with normal state
tag_name = ''
s = """
<script type="text/javascript">
? var i = 10;
? if (i < 5) {
? ? // some code
? }
</script>
<sometag>
? test string
? <a > another string</a>
</sometag>
"""
for char in s:
? # print(char, '-', state, ':', tag_name)
? if state == NORMAL:
? ? if char == '<':
? ? ? state = BEFORE_TAG_OPENING_OR_ENDING
? ? else:
? ? ? output += char
? elif state == BEFORE_TAG_OPENING_OR_ENDING:
? ? if char == '/':
? ? ? state = EXITING_TAG
? ? else:
? ? ? tag_name += char
? ? ? state = TAG_NAME
? elif state == TAG_ATTRIBUTE:
? ? if char == '>':
? ? ? if tag_name == 'script':
? ? ? ? state = INSIDE_JAVASCRIPT
? ? ? else:
? ? ? ? state = NORMAL
? elif state == TAG_NAME:
? ? if char == ' ':
? ? ? state = TAG_ATTRIBUTE
? ? elif char == '>':
? ? ? if tag_name == 'script':
? ? ? ? state = INSIDE_JAVASCRIPT
? ? ? else:
? ? ? ? state = NORMAL
? ? else:
? ? ? tag_name += char
? elif state == INSIDE_JAVASCRIPT:
? ? if char == '<':
? ? ? state = ABOUT_TO_EXIT_JS
? ? else:
? ? ? pass
? ? ? # output += char
? elif state == ABOUT_TO_EXIT_JS:
? ? if char == '/':
? ? ? state = EXITING_TAG
? ? ? tag_name = ''
? ? else:
? ? ? # output += '<'
? ? ? state = INSIDE_JAVASCRIPT
? elif state == EXITING_TAG:
? ? if char == '>':
? ? ? state = NORMAL
print(output)
輸出:
? test string
? another string
添加回答
舉報