已解決430363個問題，去搜搜看，總會有你想問的

Snakemake 從輸入文件名派生多個變量

首頁猿問 Snakemake...

Snakemake 從輸入文件名派生多個變量

Python

慕斯王 2023-10-31 14:19:41

我在從輸入文件名派生變量時遇到問題 - 特別是如果您想根據分隔符進行拆分。我嘗試了不同的方法（我無法開始工作），到目前為止唯一有效的方法最終失敗了，因為它尋找變量的所有可能變化（以及因此不存在的輸入文件）。我的問題 - 輸入文件按以下模式命名： 18AR1376_S57_R2_001.fastq.gz我一開始對變量的初步定義：SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")但這最終導致我的文件全部被命名18AR1376_S57，我想刪除 _S57 （指的是示例表 ID）。我在搜索時發現的一種可行的方法是：SAMPLES,SHEETID, = glob_wildcards("../run_links/{sample}_{SHEETID}_R1.001.fastq.gz"}但它會查找樣本和sheetid的所有可能組合，因此會查找不存在的輸入文件。然后我嘗試了一種更基本的 python 方法：SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz")ID,=expand([{sample}.split("_")[0]], sample=SAMPLES)``但這根本不起作用然后我嘗試保留原來的通配符 glob SAMPLES, = glob_wildcards("../run_links/{sample}_R1_001.fastq.gz") 但定義新的輸出文件名稱（基于我在另一個論壇中找到的說明） - 但這給了我一個我無法弄清楚的語法錯誤。rule trim: input: R1 = "../run_links/{sample}_R1_001.fastq.gz", R2 = "../run_links/{sample}_R2_001.fastq.gz" params: prefix=lambda wildcards, output: output: R1_Pair = ["./output/trimmed/{}_R1_trim_PE.fastq".format({sample}.split("_")[0])], #or the below version R1_Sing = ["./output/trimmed/{}_R1_trim_SE.fastq".format(a) for a in {sample}.split("_")], R2_Pair = ["./output/trimmed/{}_R2_trim_PE.fastq".format(a) for a in {sample}.split("_")], R2_Sing = ["./output/trimmed/{}_R2_trim_SE.fastq".format(a) for a in {sample}.split("_")] resources: cpus=8 log: "../logs/trim_{sample}.log" conda: "envs/trim.yaml" shell: """ trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 ILLUMINACLIP:scripts/Truseq-Adapter.fa:2:30:10 LEADING:20 TRAILING:20 SLIDINGWINDOW:4:20 MINLEN:50 """所以，有兩個選擇，工作流程開始：將樣本拆分為樣本和樣本表 ID定義新的輸出名稱并在_{sample} 的分隔符上使用 split有人對如何解決這個問題有建議嗎？謝謝

查看完整描述

1 回答

開滿天機

TA貢獻1786條經驗獲得超13個贊

glob_wildcards我將使用一個簡單的 python 字典來定義附加到 fastq 文件的示例名稱，而不是使用：

import os

import re

d = dict()

fastqPath = "."

for fastqF in [f for f in os.listdir(fastqPath) if(re.match(r"^[\w-]+_R1_001\.fastq\.gz", f))]:

s = re.search(r"(^[\w-]+)_(S\d+)_R1_(001.fastq.gz)", fastqF)

samplename = s.group(1)

fastqFfile = os.path.join(fastqPath, fastqF)

fastqRfile = os.path.join(fastqPath, s.group(1) + "_" + s.group(2) + "_R2_" + s.group(3))

if(os.path.isfile(fastqRfile)):

d[samplename] = {"read1":os.path.abspath(fastqFfile),"read2":os.path.abspath(fastqRfile)}

fastq 輸入文件非常易于使用：

rule all:

input:

expand("output/trimmed/{sample}_R1_trim_PE.fastq",sample=d)

rule trim:

input:

R1 = lambda wildcards: d[wildcards.sample]["read1"],

R2 = lambda wildcards: d[wildcards.sample]["read2"]

output:

R1_Pair="output/trimmed/{sample}_R1_trim_PE.fastq",

R1_Sing="output/trimmed/{sample}_R1_trim_SE.fastq",

R2_Pair="output/trimmed/{sample}_R2_trim_PE.fastq",

R2_Sing="output/trimmed/{sample}_R2_trim_SE.fastq"

resources:

cpus=8

log:

"../logs/trim_{sample}.log"

conda:

"envs/trim.yaml"

shell:

"""

trimmomatic PE -trimlog {log} -threads {resources.cpus} {input.R1} {input.R2} {output.R1_Pair} {output.R1_Sing} {output.R2_Pair} {output.R2_Sing} HEADCROP:10 I>

"""

注意：我刪除了params您的規則中未使用的部分trim。

反對回復 2023-10-31

1 回答
0 關注
165 瀏覽

關注

添加回答

舉報

0/150

提交

取消

亚洲在线久爱草,狠狠天天香蕉网,天天搞日日干久草,伊人亚洲日本欧美

熱搜

最近搜索清空

Snakemake 從輸入文件名派生多個變量

Snakemake 從輸入文件名派生多個變量

1 回答

添加回答