亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

為了賬號安全,請及時綁定郵箱和手機立即綁定
已解決430363個問題,去搜搜看,總會有你想問的

如何從平面文件(Gene Ontology OBO 文件)生成遞歸樹狀字典?

如何從平面文件(Gene Ontology OBO 文件)生成遞歸樹狀字典?

慕無忌1623718 2021-12-08 10:32:33
我正在嘗試編寫代碼來解析 Gene Ontology (GO) OBO 文件并將 go 術語 ID(例如 GO:0003824)推送到樹狀嵌套字典中。OBO 文件中的層次結構用“is_a”標識符表示,用于標記每個 GO 術語的每個父級。一個 GO 術語可能有多個父級,而層次結構中最高的 Go 術語沒有父級。GO OBO 文件的一個小例子如下所示:[Term]id: GO:0003674name: molecular_functionnamespace: molecular_functionalt_id: GO:0005554def: "A molecular process that can be carried out by the action of a single macromolecular machine, usually via direct physical interactions with other molecular entities. Function in this sense denotes an action, or activity, that a gene product (or a complex) performs. These actions are described from two distinct but related perspectives: (1) biochemical activity, and (2) role as a component in a larger system/process." [GOC:pdt]comment: Note that, in addition to forming the root of the molecular function ontology, this term is recommended for use for the annotation of gene products whose molecular function is unknown. When this term is used for annotation, it indicates that no information was available about the molecular function of the gene product annotated as of the date the annotation was made; the evidence code "no data" (ND), is used to indicate this. Despite its name, this is not a type of 'function' in the sense typically defined by upper ontologies such as Basic Formal Ontology (BFO). It is instead a BFO:process carried out by a single gene product or complex.subset: goslim_aspergillussubset: goslim_candidasubset: goslim_chemblsubset: goslim_genericsubset: goslim_metagenomicssubset: goslim_pirsubset: goslim_plantsubset: goslim_yeastsynonym: "molecular function" EXACT []
查看完整描述

2 回答

?
繁花不似錦

TA貢獻1851條經驗 獲得超4個贊

你寫了


if (parent_go_id in parent_list):

    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)

正確的是


if (parent_go_id in parent_list):

    go_dict[parent_go_id][go_id] = generate_go_tree([go_term], all_go_terms, True)[go_id]

在此更改后,它會產生:


{

    'GO:0003674': {

        'GO:0003824': {}, 

        'GO:0005198': {}, 

        'GO:0005488': {

            'GO:0005515': {},

            'GO:0005549': {

                'GO:0005550': {}

            }

        }

    }

}

但我會建議完全不同的方法。創建一個類來解析術語并構建依賴樹,因為它這樣做。


為方便起見,我將它派生自dict,因此您可以編寫term.id而不是term['id']:


class Term(dict):

    __getattr__ = dict.__getitem__

    __setattr__ = dict.__setitem__

    __delattr__ = dict.__delitem__


    registry = {}

    single_valued = 'id name namespace alt_id def comment synonym is_a'.split()

    multi_valued = 'subset xref'.split()


    def __init__(self, text):

        self.children = []

        self.parent = None


        for line in text.splitlines():

            if not ': ' in line:

                continue

            key, val = line.split(': ', 1)

            if key in Term.single_valued:

                self[key] = val

            elif key in Term.multi_valued:

                if not key in self:

                    self[key] = [val]

                else:

                    self[key].append(val)

            else:

                print('unclear property: %s' % line)


        if 'id' in self:

            Term.registry[self.id] = self


        if 'alt_id' in self:

            Term.registry[self.alt_id] = self


        if 'is_a' in self:

            key = self.is_a.split(' ! ', 1)[0]

            if key in Term.registry:

                Term.registry[key].children.append(self)

                self.parent = Term.registry[key]


    def is_top(self):

        return self.parent == None


    def is_valid(self):

        return self.get('is_obsolete') != 'true' and self.id != None

現在,您可以一口氣讀取文件:


with open('tiny_go.obo', 'rt') as f:

    contents = f.read()


terms = [Term(text) for text in contents.split('\n\n')]

并且遞歸樹變得容易。例如,一個僅輸出非過時節點的簡單“打印”函數:


def print_tree(terms, indent=''):

    valid_terms = [term for term in terms if term.is_valid()]

    for term in valid_terms:

        print(indent + 'Term %s - %s' % (term.id, term.name))

        print_tree(term.children, indent + '  ')


top_terms = [term for term in terms if term.is_top()]


print_tree(top_terms)

這打印:


術語 GO:0003674-molecular_function

  術語 GO:0003824 - 催化活性

  術語 GO:0005198 - 結構分子活性

  術語 GO:0005488 - 綁定

    術語 GO:0005515 - 蛋白質結合

    術語 GO:0005549 - 氣味綁定

      術語 GO:0005550 - 信息素結合

你也可以做類似的事情Term.registry['GO:0005549'].parent.name,這會得到"binding".


我將生成嵌套dicts的 GO-ID(例如在您自己的示例中)作為練習,但您甚至可能不需要它,因為Term.registry已經與此非常相似。


查看完整回答
反對 回復 2021-12-08
?
侃侃無極

TA貢獻2051條經驗 獲得超10個贊

您可以將遞歸用于更短的解決方案:


import itertools, re, json

content = list(filter(None, [i.strip('\n') for i in open('filename.txt')]))

entries = [[a, list(b)] for a, b in itertools.groupby(content, key=lambda x:x== '[Term]')]

terms = [(lambda x:x if 'is_a' not in x else {**x, 'is_a':re.findall('^GO:\d+', x['is_a'])[0]})(dict(i.split(': ', 1) for i in b)) for a, b in entries if not a]

terms = sorted(terms, key=lambda x:'is_a' in x)

def tree(d, _start):

  t = [i for i in d if i.get('is_a') == _start]

  return {} if not t else {i['id']:tree(d, i['id']) for i in t}


print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

輸出:


{

  "GO:0003674": {

    "GO:0003824": {},

    "GO:0005198": {},

    "GO:0005488": {

        "GO:0005515": {},

        "GO:0005549": {

            "GO:0005550": {}

        }

      }

   }

}

如果父數據集未在其子數據集之前定義,這也將起作用。例如,當父級位于其原始位置以下三個位置時,仍會生成相同的結果(請參閱文件):


print(json.dumps({terms[0]['id']:tree(terms, terms[0]['id'])}, indent=4))

輸出:


{

"GO:0003674": {

    "GO:0003824": {},

    "GO:0005198": {},

    "GO:0005488": {

        "GO:0005515": {},

        "GO:0005549": {

            "GO:0005550": {}

        }

      }

   }

}


查看完整回答
反對 回復 2021-12-08
  • 2 回答
  • 0 關注
  • 456 瀏覽
慕課專欄
更多

添加回答

舉報

0/150
提交
取消
微信客服

購課補貼
聯系客服咨詢優惠詳情

幫助反饋 APP下載

慕課網APP
您的移動學習伙伴

公眾號

掃描二維碼
關注慕課網微信公眾號