Skip to content

Latest commit

 

History

History
82 lines (61 loc) · 2.57 KB

presentation.md

File metadata and controls

82 lines (61 loc) · 2.57 KB

%Writing Zotero Web Translators %Sebastian Karcher %PLoS Webinar, November 3rd, 2014

#Types of Zotero Translators

  • Web Translator (Imports data from web via URL bar icon)
  • Import Translator (Imports data from a file/clipboard)
  • Search Translator (imports from a database based on an identifier: DOI, ISBN, PMID)
  • Export Translator (exports Zotero data)

#Types of Web Translators

  • Screen Scrapers
    • Based on Framework
    • From Scratch
  • Using Import format
    • Get it from site header
    • Get it via GET or POST
    • MARC is a special case
  • Using Search (But we rarely use this)

#How a Web Translator Works

  • "Do I know this page?" -- Target Regex
  • "Can I import from this page?" -- detectWeb
  • Doing the actual work -- doWeb

#Xpaths -- Pointing to content on a webpage

  • xpaths are basically "directions" used to point to a part of a webpage
  • A webpage is built up from a number of nested nodes
  • This is what the most simple webpage looks like
<html>
	<head>
		<title>A Basic Webpage</title>
	</head>
	<body>
		<div id="title">Title</div>
		<div id="content" class="text">Content</div>
	</body>
</html>

#The most basic Xpath

  • Give directions: at every corner/node, tell Zotero where to go:

  • Let's say we want to go go to "The Content of the webpage"

  • "Take the HTLM road, take a left at "body", then take the "div" street, or in HTML:

      /html/body/div
    

#Making Xpaths more precise

  • But we're still "lost" - which of the two "div" streets do we go down?

  • Option 1: Take the second <div>

      /html/body/div[2]
    
  • Option 2: Take the <div> that has "content" as an id

      /html/body/div[@id="content"]
    

#Making Xpaths more efficient

  • In an actual webpage, an xpath can be very long, so we'd like to make them shorter. we can use // to start anywhere in the html tree, e.g "the <div> with "content" as an "id" anywhere on the site:

      //div[@id="content"]
    
  • Sometimes we don't want the precise content of an attribute like id - in those case we can use contains() as in

      //div[contains(@id, "cont")]
    
  • We can combine conditions with "and" or "or" (in lowercase!)

      //div[@id="content" and @class="text"]
    

#Zotero's built in Xpath helpers

  • ZU.xpathText(doc, xpath) returns the text of all xpath nodes, separated by comma
  • ZU.xpath(doc, xpath) returns an object of all xpath nodes
  • You can also use any other javaScript function like doc.evaluate or doc.getElementsBy...

#Our Tools

  • Scaffold - a Firefox extension to write and test the translator
  • Firefox "Inspect Element" - to help us understand the structure of a webpage (there are alternatives like "Firebug")