Skip to content

Commit 8898d23

Browse files
committed
changes
1 parent 712df6d commit 8898d23

14 files changed

+297
-209
lines changed

.gitignore

+6-3
Original file line numberDiff line numberDiff line change
@@ -93,10 +93,13 @@ ENV/
9393

9494
.idea/
9595
.DS_Store
96-
*/inputf.txt
97-
*/outputf.txt
96+
inputf.txt
97+
outputf.txt
9898
bin/
9999
kmers.txt
100-
pdb/**.ent.gz
100+
pdb/*.ent.gz
101+
pdb/*/*.ent.gz
102+
uniprotkb/**.fasta.gz
103+
101104

102105
.vscode/*

README.md

+24-16
Original file line numberDiff line numberDiff line change
@@ -6,22 +6,23 @@ Description and purpose of the project will come at a later point.
66
### Prerequisites
77
`g++`: through gcc installation
88

9-
`sqlite3`: Debian/Ubuntu: `apt install sqlite3`; macOS: `brew install sqlite3`
10-
119
`gzip / gunzip`: usually pre-installed
1210

1311
`python3`: installed and part of path
1412

15-
### Preperation
13+
optional: `sqlite3`: Debian/Ubuntu: `apt install sqlite3`; macOS: `brew install sqlite3`
14+
15+
### Preparation
16+
1617
Clone the repo
1718
```
1819
git clone https://github.com/cemiu/kmers.git && cd kmers
1920
```
2021
Two folders need to be populated:
21-
- pdb
22-
- uniprotkb
22+
- `pdb` (required)
23+
- `uniprotkb` (optional)
2324
### pdb
24-
The pdb folder has to contain all experimental PDB file in the .ent.gz format.
25+
The `pdb` folder has to contain experimental PDB files in the .ent.gz format.
2526

2627
Instructions for downloading can be found here:
2728

@@ -31,18 +32,24 @@ https://files.wwpdb.org/pub/pdb/data/structures/divided/pdb/
3132

3233
Alternatively a mirror can be found here: https://pycom.brunel.ac.uk/misc/pdb_2023-07-28.tar (42 GB)
3334

34-
Once downloaded they have to be placed in the `pdb` folder **without** being uncompressed. It does not matter whether they are in `pdb/file.ent.gz` or `pdb/<folder>/file.ent.gz`.
35+
Once downloaded they have to be placed in the `pdb` folder **without** being decompressed.
36+
It does not matter whether they are divided; e.g. `pdb/file.ent.gz` or `pdb/<folder>/file.ent.gz`.
3537

3638
### uniprotkb
37-
The project requires `uniprot_sprot.fasta.gz` (400 MB after processing)
39+
**Optionally**, the `uniprotkb` folder can be populated with the uniprotkb fasta files.
40+
This is only needed, if the PDBs should be associated to a Protein. If this is not required, **skip this step**.
3841

39-
Optionally, `uniprot_trembl.fasta.gz` can be used, to match more PDBs (250 GB after processing).
42+
Files:
43+
- `uniprot_sprot.fasta.gz` (400 MB after processing)
44+
- Use only Swiss-Prot, has the majority of PDB coverage
45+
- Optionally, `uniprot_trembl.fasta.gz` (250 GB after processing)
46+
- Use TrEMBL to match more PDBs; might be useful for max. coverage
4047

4148
The latter might result in (slightly) more PDBs which can be associated to a Protein. The difference is expected to be trivial.
4249

43-
Place the files in the `uniprotkb` folder without uncompressing them.
50+
Place the files in the `uniprotkb` folder without decompressing them.
4451

45-
By default, only Swiss-Prot is used. To also use TrEMBL, uncomment line 11 in `run.sh`.
52+
Once `run.sh` has been executed and the database `uniprotkb/uniprot_sequences.db` has been created, the files can be deleted.
4653

4754
### Running
4855

@@ -52,8 +59,9 @@ To run the script, execute:
5259
```
5360

5461
This will
55-
- Compile C++ binaries
56-
- Process Swiss-Prot / TrEMBL into a database
57-
- (`uniprot_sprot.fasta.gz` / `uniprot_trembl.fasta.gz`) can be deleted afterwards
58-
- Extract 3d k-mer from the PDBs
59-
- TODO: process the k-mers
62+
- Compile C++ binaries, if they don't exist
63+
- Ask whether to process the uniprotkb files
64+
- If yes, create the database `uniprotkb/uniprot_sequences.db`, if it doesn't exist
65+
- Ask for k-mer size (default: k=12)
66+
- Extracts 3d k-mer from the PDBs (into `pdb_output` folder)
67+
- Extracts k-mer of length k into `kmer.txt`, along with frequency

cpp_scripts/extract_pdb_coordinates/AtomDataParser.h

+17-3
Original file line numberDiff line numberDiff line change
@@ -2,15 +2,29 @@
22
#define ATOMDATAPARSER_H
33

44
#include <string>
5+
#include <sstream>
56

67
struct AtomData
78
{
8-
bool isValidAtom; // whether the atom is valid (CA)
9+
bool isValidAtom; // whether the atom is valid (CA)
910
std::string resName; // residue name (AA)
1011
int resSeq; // residue sequence number
1112
float x, y, z;
12-
};
1313

14-
void parseAtomData(const std::string& str, AtomData& data, size_t offset = 0);
14+
AtomData(const std::string& str, size_t offset = 0)
15+
{
16+
std::string atom_name = str.substr(-offset + 12, 4);
17+
std::stringstream atom_ss(atom_name);
18+
atom_ss >> atom_name;
19+
20+
isValidAtom = atom_name == "CA"; // whether the atom is ca
21+
22+
resName = str.substr(-offset + 17, 3);
23+
resSeq = std::stoi(str.substr(-offset + 22, 4));
24+
x = std::stof(str.substr(-offset + 30, 8));
25+
y = std::stof(str.substr(-offset + 38, 8));
26+
z = std::stof(str.substr(-offset + 46, 8));
27+
}
28+
};
1529

1630
#endif // ATOMDATAPARSER_H

cpp_scripts/extract_pdb_coordinates/Constants.cpp

+7-3
Original file line numberDiff line numberDiff line change
@@ -1,5 +1,7 @@
11
#include "Constants.h"
22

3+
#include <unordered_set>
4+
35
#define X(code, name) name,
46
const char *code_name[] = {
57
PDB_PARSING_CODES
@@ -16,7 +18,9 @@ const std::unordered_map<std::string, char> aminoAcidLookup = {
1618
{"TRP", 'W'}, {"VAL", 'V'}, {"SEC", 'U'}, {"PYL", 'O'},
1719
{"XPL", 'O'}, // for pdb 1L2Q
1820
{"GLX", 'Z'}, // for pdb 1KP0
19-
{"ASX", 'B'} // for pdb 1KP0
20-
// 3e2o, 2fmd, 2atc, 4cpa
21-
21+
{"ASX", 'B'}, // for pdb 1KP0
22+
{"UNK", '.'} // unknown AA
2223
};
24+
25+
// rare amino acids = SELENOCYSTEINE, PYRROLYSINE, ..., & unknown AA
26+
const std::unordered_set<char> invalidAminoAcids{'U', 'O', 'Z', 'B', '.'};

cpp_scripts/extract_pdb_coordinates/Constants.h

+3-3
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,7 @@
22
#define CONSTANTS_H
33

44
#include <unordered_map>
5+
#include <unordered_set>
56
#include <string>
67

78
#define PDB_PARSING_CODES \
@@ -11,13 +12,11 @@ X(RESOLUTION_NOT_SPECIFIED, "RESOLUTION_NOT_SPECIFIED") \
1112
X(MISSING_NON_TERMINAL_RESIDUES, "MISSING_NON_TERMINAL_RESIDUES") \
1213
X(NO_ALPHA_CARBON_ATOMS_FOUND, "NO_ALPHA_CARBON_ATOMS_FOUND") \
1314
X(IS_NOT_PROTEIN, "IS_NOT_PROTEIN") \
14-
X(EXCLUDE_RARE_AMINO_ACIDS, "EXCLUDE_RARE_AMINO_ACIDS") \
15+
X(EXCLUDE_UNKNOWN_OR_RARE_AMINO_ACIDS, "EXCLUDE_UNKNOWN_OR_RARE_AMINO_ACIDS") \
1516
X(HAS_UNKNOWN_RESIDUE, "HAS_UNKNOWN_RESIDUE") \
1617
X(INVALID_SEQUENCE, "INVALID_SEQUENCE") \
1718
X(NO_UNIPROT_ID, "NO_UNIPROT_ID") \
1819

19-
// rare amino acids = SELENOCYSTEINE, PYRROLYSINE, others
20-
2120
#define X(code, name) code,
2221
enum PDBParsingCode : size_t {
2322
PDB_PARSING_CODES
@@ -37,6 +36,7 @@ enum PDBType {PROTEIN, DNA, RNA, MISC};
3736

3837
extern const float MAX_RESOLUTION;
3938
extern const std::unordered_map<std::string, char> aminoAcidLookup;
39+
extern const std::unordered_set<char> invalidAminoAcids;
4040

4141
#endif // CONSTANTS_H
4242

Original file line numberDiff line numberDiff line change
@@ -0,0 +1,50 @@
1+
#ifndef PDBCONTEXT_H
2+
#define PDBCONTEXT_H
3+
4+
#include <string>
5+
#include <vector>
6+
#include <sstream>
7+
#include <unordered_map>
8+
#include <unordered_set>
9+
10+
struct PDBContext {
11+
// main data
12+
std::string pdbId;
13+
float resolution = -1.0f;
14+
std::unordered_set<std::string> uniprotIds;
15+
16+
// input and output data
17+
std::vector<std::string> output;
18+
19+
// sequence data
20+
std::stringstream parsedSequence;
21+
std::unordered_map<char, std::stringstream> sequenceStreams;
22+
23+
// residue tracking
24+
int prevCAResiduePosition = -1;
25+
int firstCAResidue = 0;
26+
27+
// condition flags
28+
bool hasResiduesOutOfOrder = true;
29+
bool anyCAAtomsPresent = false;
30+
bool isNotProtein = false;
31+
bool hasExcludedAminoAcid = false;
32+
33+
// error tracking
34+
std::vector<std::string> errorOutput;
35+
36+
void resetPDBOutput() {
37+
if (!anyCAAtomsPresent && output.size()) {
38+
anyCAAtomsPresent = true;
39+
}
40+
41+
output.clear();
42+
hasResiduesOutOfOrder = true;
43+
hasExcludedAminoAcid = false;
44+
prevCAResiduePosition = -1;
45+
firstCAResidue = 0;
46+
parsedSequence.str("");
47+
}
48+
};
49+
50+
#endif // PDBCONTEXT_H

cpp_scripts/extract_pdb_coordinates/Utils.cpp

+12-14
Original file line numberDiff line numberDiff line change
@@ -5,6 +5,7 @@
55
#include <iostream>
66

77
#include "Constants.h"
8+
#include "PDBContext.h"
89

910
std::string concatenateString(const std::vector<std::string>& strings) {
1011
const char delim = ',';
@@ -56,11 +57,11 @@ float extractResolution(const std::string &line) {
5657
/// PROCESS ENTRY ///
5758
/////////////////////
5859

59-
PDBType processHeader(const std::string &line) {
60+
PDBType processHeader(const std::string &line, PDBContext &con) {
6061
std::string cls = line.substr(10, 40); // 11-50
6162
std::string pdbId = line.substr(62, 4); // 63-66
6263

63-
std::cout << "pdb_id: " << pdbId << std::endl;
64+
con.pdbId = pdbId;
6465

6566
if (cls.find("DNA") != std::string::npos) {
6667
if (cls.find("DNA BINDING PROTEIN") == std::string::npos)
@@ -75,31 +76,31 @@ PDBType processHeader(const std::string &line) {
7576
}
7677

7778
// Remark row
78-
void processRemark(const std::string &line, float &resolution) {
79+
void processRemark(const std::string &line, PDBContext &con) {
7980
int remark_no = std::stoi(line.substr(7, 3));
8081
switch (remark_no) {
8182
case 2:
8283
int extractedRes = extractResolution(line);
8384
if (extractedRes != -1) {
84-
resolution = extractResolution(line);
85+
con.resolution = extractResolution(line);
8586
}
8687
break;
8788
}
8889
}
8990

9091
// DBRef row
91-
void processDBRef(const std::string &line, std::unordered_set<std::string> &uniprotIds) {
92+
void processDBRef(const std::string &line, PDBContext &con) {
9293
std::string db = line.substr(26, 6); // 27 - 32
9394
if (db != "UNP ") // only match uniprot
9495
return;
9596

9697
std::string uniprotId = line.substr(33, 8); // 34 - 41
9798
std::stringstream parser(uniprotId);
9899
parser >> uniprotId;
99-
uniprotIds.insert(uniprotId);
100+
con.uniprotIds.insert(uniprotId);
100101
}
101102

102-
void processDBRef1(const std::string &line, std::unordered_set<std::string> &uniprotIds) {
103+
void processDBRef1(const std::string &line, PDBContext &con) {
103104
// process 1 for uniprot
104105
std::string db = line.substr(26, 6); // 27 - 32
105106
if (db != "UNP ") // only match uniprot
@@ -112,24 +113,21 @@ void processDBRef1(const std::string &line, std::unordered_set<std::string> &uni
112113
std::string uniprotId = nextLine.substr(18, 22); // 19 - 40
113114
std::stringstream parser(uniprotId);
114115
parser >> uniprotId;
115-
uniprotIds.insert(uniprotId);
116+
con.uniprotIds.insert(uniprotId);
116117
}
117118

118119
// SEQRES row
119-
void processSequence(const std::string &line, std::unordered_map<char, std::stringstream> &sequenceStreams) {
120+
void processSequence(const std::string &line, PDBContext &con) {
120121
char chainId = line[11];
121122
std::string aa;
122123
std::string aaLine = line.substr(19, 51);
123124
std::stringstream aaStream(aaLine);
124125
while (aaStream >> aa) {
125126
try {
126-
sequenceStreams[chainId] << aminoAcidLookup.at(aa);
127+
con.sequenceStreams[chainId] << aminoAcidLookup.at(aa);
127128
} catch (std::out_of_range) {
128129
// replace non-standard AA with dot (.)
129-
sequenceStreams[chainId] << '.';
130-
131-
// sequenceStreams.erase(chainId);
132-
// throw std::out_of_range("test");
130+
con.sequenceStreams[chainId] << '.';
133131
return;
134132
}
135133
}

cpp_scripts/extract_pdb_coordinates/Utils.h

+6-6
Original file line numberDiff line numberDiff line change
@@ -6,15 +6,15 @@
66
#include <unordered_set>
77

88
#include "Constants.h"
9+
#include "PDBContext.h"
910

1011
std::string concatenateString(const std::vector<std::string>& strings);
1112
std::string concatenateString(const std::unordered_set<std::string>& strings);
1213

13-
PDBType processHeader(const std::string &line);
14-
float extractResolution(const std::string &line);
15-
void processRemark(const std::string &line, float &resolution);
16-
void processDBRef(const std::string &line, std::unordered_set<std::string> &uniprotIds);
17-
void processDBRef1(const std::string &line, std::unordered_set<std::string> &uniprotIds);
18-
void processSequence(const std::string &line, std::unordered_map<char, std::stringstream> &sequenceStreams);
14+
PDBType processHeader(const std::string &line, PDBContext &con);
15+
void processRemark(const std::string &line, PDBContext &con);
16+
void processDBRef(const std::string &line, PDBContext &con);
17+
void processDBRef1(const std::string &line, PDBContext &con);
18+
void processSequence(const std::string &line, PDBContext &con);
1919

2020
#endif // UTILS_H
Binary file not shown.

0 commit comments

Comments
 (0)