Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Parsing HTML and then pretty printing it #18

Open
rlaferla opened this issue May 12, 2016 · 7 comments
Open

Parsing HTML and then pretty printing it #18

rlaferla opened this issue May 12, 2016 · 7 comments

Comments

@rlaferla
Copy link

rlaferla commented May 12, 2016

I'm trying to parse some html text and then pretty print the entire document. I couldn't tell what was the best way to traverse the hierarchy of nodes/elements and wasn't sure how to get the inner html content of a tag. I'm posting here because I think this could improve the documentation for the API.

   let html = "**** put some html text here. ****"

       let doc = try HTMLDocument(string: html, encoding: NSUTF8StringEncoding)

        if let root = doc.root {
                let str = self.dumpElement(root)
                print(str)
        }

    func dumpElement(element:XMLElement) -> String {
        var str = ""

        str = "<\(element.tag!.uppercaseString)"
        for attr in element.attributes {
            str += " \(attr.0)=\(attr.1)"
        }
        str += ">"
        let nodes = element.childNodes(ofTypes: [.Text])
        for node in nodes {
            str += node.stringValue
        }

        for el in element.children {
            str += self.dumpElement(el)
        }
        str += "</\(element.tag!.uppercaseString)>"
        return str
    }

Is this correct?

@cezheng
Copy link
Owner

cezheng commented May 13, 2016

What do you mean exactly by pretty print?

The code you shared actually does the following things:

  • Make the element's tag name uppercase
  • Ignored all child node that are not text nodes and elements(which includes CDATA, comment, etc.)
  • Dump all text nodes first, then elements after them, regardless of their order

So I don't really think it makes sense.

@cezheng
Copy link
Owner

cezheng commented May 13, 2016

I guess you only want to pretty print some xml for while debugging? try the xmllint command.

@hvtor
Copy link

hvtor commented Sep 5, 2016

How might you create the html String? I'm using a separate class that inherits from NSObject to parse down a URL.

func httpGet(request: NSURLRequest!, callback: @escaping (String, String?) -> Void) {
        var session = URLSession.shared
        var task = session.dataTask(with: request as URLRequest){
            (data, response, error) -> Void in
            if error != nil {
                callback("error", error?.localizedDescription)
            } else {
                var result = String(data: data!, encoding:
                    String.Encoding(rawValue: String.Encoding.ascii.rawValue))!
                callback(result as String, nil)
            }
        }
        task.resume()
    }

in my ViewController I'm trying to :

let html = data
do {
        let doc = try HTMLDocument(string: html, encoding: String.Encoding.utf8)
        } catch {   
        }

but I get an error for the 'html' variable, 'use of unresolved identifier.'

How do I set the html as a string from my initial URLrequest set up in my data service?

@cezheng
Copy link
Owner

cezheng commented Sep 5, 2016

@hvtor you don't need to create the string if you have NSData. It is stated in the README that you can create a document with either a String, an NSData(Data for Swift 3), or [CChar] instance. Actually having a NSData instance is simpler since you don't have to specify the Encoding.

let doc = try HTMLDocument(data: data)

@hvtor
Copy link

hvtor commented Sep 5, 2016

@cezheng Yes. Thank you. :) BTW, great documentation. Just a bit 😴 I guess.

@cezheng
Copy link
Owner

cezheng commented Sep 5, 2016

@hvtor haha, it's true. Any suggestions on improving it?

@hvtor
Copy link

hvtor commented Sep 5, 2016

No I meant I am 😴. It's great documentation. An example webpage would be great too! Showing how the elements can be mapped over. I'm not too familiar with

I'm trying to parse an IMDb list and it's a series of anchor tags. It's not clear to me how you select for specific tags.

Walking Dead Stranger Things

goes to read the docs again

Parent. And then the child tags.

# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants