首頁猿問如何從平面文件（Gene...

如何從平面文件（Gene Ontology OBO 文件）生成遞歸樹狀字典？

Python

慕無忌1623718 2021-12-08 10:32:33

我正在嘗試編寫代碼來解析 Gene Ontology (GO) OBO 文件并將 go 術語 ID（例如 GO:0003824）推送到樹狀嵌套字典中。OBO 文件中的層次結構用“is_a”標識符表示，用于標記每個 GO 術語的每個父級。一個 GO 術語可能有多個父級，而層次結構中最高的 Go 術語沒有父級。GO OBO 文件的一個小例子如下所示：[Term]id: GO:0003674name: molecular_functionnamespace: molecular_functionalt_id: GO:0005554def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.subset: goslim_aspergillussubset: goslim_candidasubset: goslim_chemblsubset: goslim_genericsubset: goslim_metagenomicssubset: goslim_pirsubset: goslim_plantsubset: goslim_yeastsynonym: "molecular function" EXACT []

查看完整描述

2 回答

繁花不似錦

TA貢獻1851條經驗獲得超4個贊

你寫了

if (parent_go_id in parent_list):

go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)

正確的是

if (parent_go_id in parent_list):

go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]

在此更改后，它會產生：

{

'GO:0003674': {

'GO:0003824': {},

'GO:0005198': {},

'GO:0005488': {

'GO:0005515': {},

'GO:0005549': {

'GO:0005550': {}

}

但我會建議完全不同的方法。創建一個類來解析術語并構建依賴樹，因為它這樣做。

為方便起見，我將它派生自dict，因此您可以編寫term.id而不是term['id']：

class Term(dict):

__getattr__ = dict.__getitem__

__setattr__ = dict.__setitem__

__delattr__ = dict.__delitem__

registry = {}

single_valued = 'id name namespace alt_id def comment synonym is_a'.split()

multi_valued = 'subset xref'.split()

def __init__(self, text):

self.children = []

self.parent = None

for line in text.splitlines():

if not ': ' in line:

continue

key, val = line.split(': ', 1)

if key in Term.single_valued:

self[key] = val

elif key in Term.multi_valued:

if not key in self:

self[key] = [val]

else:

self[key].append(val)

else:

print('unclear property: %s' % line)

if 'id' in self:

Term.registry[self.id] = self

if 'alt_id' in self:

Term.registry[self.alt_id] = self

if 'is_a' in self:

key = self.is_a.split(' ! ', 1)[0]

if key in Term.registry:

Term.registry[key].children.append(self)

self.parent = Term.registry[key]

def is_top(self):

return self.parent == None

def is_valid(self):

return self.get('is_obsolete') != 'true' and self.id != None

現在，您可以一口氣讀取文件：

with open('tiny_go.obo', 'rt') as f:

contents = f.read()

terms = [Term(text) for text in contents.split('\n\n')]

并且遞歸樹變得容易。例如，一個僅輸出非過時節點的簡單“打印”函數：

def print_tree(terms, indent=''):

valid_terms = [term for term in terms if term.is_valid()]

for term in valid_terms:

print(indent + 'Term %s - %s' % (term.id, term.name))

print_tree(term.children, indent + ' ')

top_terms = [term for term in terms if term.is_top()]

print_tree(top_terms)

這打印：

術語 GO:0003674-molecular_function

術語 GO:0003824 - 催化活性

術語 GO:0005198 - 結構分子活性

術語 GO:0005488 - 綁定

術語 GO:0005515 - 蛋白質結合

術語 GO:0005549 - 氣味綁定

術語 GO:0005550 - 信息素結合

你也可以做類似的事情Term.registry['GO:0005549'].parent.name，這會得到"binding".

我將生成嵌套dicts的 GO-ID（例如在您自己的示例中）作為練習，但您甚至可能不需要它，因為Term.registry已經與此非常相似。

反對回復 2021-12-08

侃侃無極

TA貢獻2051條經驗獲得超10個贊

您可以將遞歸用于更短的解決方案：

import itertools, re, json

content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))

entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]

terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]

terms = sorted(terms, key=lambda x:'is_a' in x)

def tree(d, _start):

t = [i for i in d if i.get('is_a') == _start]

return {} if not t else {i['id']:tree(d, i['id']) for i in t}

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

輸出：

{

"GO:0003674": {

"GO:0003824": {},

"GO:0005198": {},

"GO:0005488": {

"GO:0005515": {},

"GO:0005549": {

"GO:0005550": {}

}

如果父數據集未在其子數據集之前定義，這也將起作用。例如，當父級位于其原始位置以下三個位置時，仍會生成相同的結果（請參閱文件）：

print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

輸出：

{

"GO:0003674": {

"GO:0003824": {},

"GO:0005198": {},

"GO:0005488": {

"GO:0005515": {},

"GO:0005549": {

"GO:0005550": {}

}

反對回復 2021-12-08

2 回答
0 關注
456 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

如何從平面文件（Gene Ontology OBO 文件）生成遞歸樹狀字典？

如何從平面文件（Gene Ontology OBO 文件）生成遞歸樹狀字典？

2 回答

添加回答

如何從平面文件（Gene Ontology OBO 文件）生成遞歸樹狀字典？