You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Alternatively a mirror can be found here: https://pycom.brunel.ac.uk/misc/pdb_2023-07-28.tar (42 GB)
33
34
34
-
Once downloaded they have to be placed in the `pdb` folder **without** being uncompressed. It does not matter whether they are in `pdb/file.ent.gz` or `pdb/<folder>/file.ent.gz`.
35
+
Once downloaded they have to be placed in the `pdb` folder **without** being decompressed.
36
+
It does not matter whether they are divided; e.g. `pdb/file.ent.gz` or `pdb/<folder>/file.ent.gz`.
35
37
36
38
### uniprotkb
37
-
The project requires `uniprot_sprot.fasta.gz` (400 MB after processing)
39
+
**Optionally**, the `uniprotkb` folder can be populated with the uniprotkb fasta files.
40
+
This is only needed, if the PDBs should be associated to a Protein. If this is not required, **skip this step**.
38
41
39
-
Optionally, `uniprot_trembl.fasta.gz` can be used, to match more PDBs (250 GB after processing).
42
+
Files:
43
+
-`uniprot_sprot.fasta.gz` (400 MB after processing)
44
+
- Use only Swiss-Prot, has the majority of PDB coverage
45
+
- Optionally, `uniprot_trembl.fasta.gz` (250 GB after processing)
46
+
- Use TrEMBL to match more PDBs; might be useful for max. coverage
40
47
41
48
The latter might result in (slightly) more PDBs which can be associated to a Protein. The difference is expected to be trivial.
42
49
43
-
Place the files in the `uniprotkb` folder without uncompressing them.
50
+
Place the files in the `uniprotkb` folder without decompressing them.
44
51
45
-
By default, only Swiss-Prot is used. To also use TrEMBL, uncomment line 11 in `run.sh`.
52
+
Once `run.sh` has been executed and the database `uniprotkb/uniprot_sequences.db` has been created, the files can be deleted.
46
53
47
54
### Running
48
55
@@ -52,8 +59,9 @@ To run the script, execute:
52
59
```
53
60
54
61
This will
55
-
- Compile C++ binaries
56
-
- Process Swiss-Prot / TrEMBL into a database
57
-
- (`uniprot_sprot.fasta.gz` / `uniprot_trembl.fasta.gz`) can be deleted afterwards
58
-
- Extract 3d k-mer from the PDBs
59
-
- TODO: process the k-mers
62
+
- Compile C++ binaries, if they don't exist
63
+
- Ask whether to process the uniprotkb files
64
+
- If yes, create the database `uniprotkb/uniprot_sequences.db`, if it doesn't exist
65
+
- Ask for k-mer size (default: k=12)
66
+
- Extracts 3d k-mer from the PDBs (into `pdb_output` folder)
67
+
- Extracts k-mer of length k into `kmer.txt`, along with frequency
0 commit comments