Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

LDSC munge for UKB sumstats #26

Open
yk-tanigawa opened this issue Jul 5, 2020 · 11 comments
Open

LDSC munge for UKB sumstats #26

yk-tanigawa opened this issue Jul 5, 2020 · 11 comments
Assignees

Comments

@yk-tanigawa
Copy link
Contributor

We convert the UKB sumstats into LDSC munge format.

This will enable us to perform

@yk-tanigawa
Copy link
Contributor Author

Focusing on the finalized summary statistic files, we started LDSC munge.

There are 20,940 such files across 7 populations and pushed the computation.

As of now,

  • 15,213 files are converted to LDSC munge
  • 5,727 files: still running.

Please see the analysis scripts for more info.

@yk-tanigawa
Copy link
Contributor Author

It turned out that there was an issue in filtering conditions and we are computing LDSC munge for all sum stats in gwas/current directory.

We now have 19,163+ munged sumstats (3,669 for WB).

Once GWAS is finalized, we can identify the updated sum stats (~1,100 in total; ~880 will be overwritten and ~230 will be added) and re-apply LDSC munge.

@yk-tanigawa
Copy link
Contributor Author

yk-tanigawa commented Jul 6, 2020

We considered applying LDSC munge for the meta-analyzed summary statistics (to get a phenotyping mapping for #25), but we decided to use the WB sum stats for mapping between FinnGen and UKB

@yk-tanigawa
Copy link
Contributor Author

Files are in /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc

@yk-tanigawa
Copy link
Contributor Author

With progress on #21, we should refresh this and update the #27 analysis

@yk-tanigawa
Copy link
Contributor Author

In 1_remove-incomplete-20200713.sh, we fixed the previous error in the filtering condition.

In the original version of 1_generate_input_list.sh, we incorrectly specified `NR>1 || $NF == 1080969`, but it should have been `NR>1 && $NF == 1080969`. This results resulted in 909 extra munged files.
Those were NOT used in the heritability analysis. In this script, we remove those 909 files.

@yk-tanigawa
Copy link
Contributor Author

yk-tanigawa commented Jul 18, 2020

With the finalized GWAS results (#21), we apply LDSC munge again.

1_LDSC_munge.20200717-210250.job.lst

has 2714 files. = 905 * 3

bash 1_generate_input_list.sh | tee 1_LDSC_munge.$(date +%Y%m%d-%H%M%S).job.lst | tee /dev/stderr | wc -l

ml load resbatch
ml R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-905 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-210250.job.lst 3

Submitted batch job 4255901

@yk-tanigawa
Copy link
Contributor Author

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc -type f -name "*.gz" | wc -l
20295
ml load resbatch R/3.6 gcc

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-1000 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.lst 1

# Submitted batch job 4260541

sbatch -p mrivas,normal,owners --time=3:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge --output=logs/munge.%A_%a.out --error=logs/munge.%A_%a.err --array=1-877 $parallel_sbatch_sh 1_LDSC_munge.sh 1_LDSC_munge.20200717-231130.job.part2.lst 1

# Submitted batch job 4260621

@yk-tanigawa
Copy link
Contributor Author

We also apply LDSC munge on the meta-analyzed sumstats.


ml load R/3.6 gcc resbatch

sbatch -p mrivas,normal,owners --time=1:00:00 --mem=8000 --nodes=1 --cores=1 --job-name=munge_meta --output=logs/munge_meta.%A_%a.out --error=logs/munge_meta.%A_%a.err --array=1-949 $parallel_sbatch_sh 1b_LDSC_munge.sh 1_LDSC_munge.20200718-134522.metal.job.lst 4
Submitted batch job 4279977

@yk-tanigawa
Copy link
Contributor Author

There are some failed files...

find /oak/stanford/groups/mrivas/ukbb24983/array-combined/ldsc/metal/
-type f -name "*.gz" | wc
   3417    3417  399695

[ytanigaw@sh02-09n54 ~/repos/rivas-lab/ukbb-tools/07_LDSC/jobs/202007_LDSC]$ wc 1_LDSC_munge.20200718-134522.metal.job.lst
  3794   3794 340971 1_LDSC_munge.20200718-134522.metal.job.lst

@guhanrv
Copy link
Collaborator

guhanrv commented Nov 19, 2020

An update on this - as a result of needing to run the pairwise rg calculations across all traits, I needed to convert all of the summary statistics to the munged format. I've tabulated the phenotypes for which the sumstats munge failed, with an error similar to the following:

Traceback (most recent call last):
  File "/opt/ldsc/munge_sumstats.py", line 701, in munge_sumstats
    check_median(dat.SIGNED_SUMSTAT, signed_sumstat_null, 0.1, sign_cname))
  File "/opt/ldsc/munge_sumstats.py", line 373, in check_median
    raise ValueError(msg.format(F=name, M=expected_median, V=round(m, 2)))
ValueError: WARNING: median value of SIGNED_SUMSTATS is 0.11 (should be close to 0.0). This column may be mislabeled.

These are at https://github.com/rivas-lab/ukbb-tools/blob/master/07_LDSC/helpers/affected_metal_traits.txt.

A quick check on the gwas.qc.tsv file for the array-combined dataset indicates these are summary statistics that are low-N traits overall.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants