-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? # to your account
Last modified headers #547
Conversation
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## main #547 +/- ##
=========================================
Coverage 93.87% 93.88%
- Complexity 48 49 +1
=========================================
Files 44 45 +1
Lines 980 1030 +50
Branches 52 55 +3
=========================================
+ Hits 920 967 +47
- Misses 36 38 +2
- Partials 24 25 +1 |
Oh, that blew up at scale real quick 😅
It's what I feared, that these dates are gonna not conform to RFC 1123, and be pretty wild. |
Here's a more fuzzy approach to convert these dates (RFC 1123 or similar) into val months = Seq("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec").zipWithIndex.map{case (s,d) => (s, ("0" + (d + 1)).takeRight(2))}
"""
Fri, 23 Oct 2009 20:0830 GMT
Sat, 24 Oct 2009 14:3224 GMT
Tue Aug 22 17:18:37 2000 GMT
Thu, 22 Oct 2009 23:1407 GMT
Sun, 25 Oct 2009 00:3030 GMT
""".split("\n").map(_.trim).filter(_.nonEmpty).flatMap { str =>
val lc = str.toLowerCase
months.find(m => lc.contains(m._1)).map(_._2).flatMap { m =>
val d = str.replace(":", "").split(' ').drop(1).map(d => (d.length, d)).toMap
for (y <- d.get(4); n <- d.get(2); t <- d.get(6)) yield y + m + n + t
}
} output: Array(
"20091023200830",
"20091024143224",
"20000822171837",
"20091022231407",
"20091025003030"
) |
At scale test with GeoCities was successful. import io.archivesunleashed._
val data = "/tuna1/scratch/nruest/geocites/warcs"
val test = RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date", $"mime_type_web_server", $"mime_type_tika")
test.write.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").format("csv").option("escape", "\"").option("encoding", "utf-8").save("/tuna1/scratch/nruest/aut-547-test")
If you're good with this @ianmilligan1, let me know and I'll squash this all down and merge. Then work on documentation updates, and a release next week, as well as getting it pulled into ARCH. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great @ruebot , and the GeoCities test is illuminating. It would be great to down the line do a little study to check on the reliability of this approach.
@ianmilligan1 yeah, I was looking at all the dates come in from GeoCities, my brain was being swarmed with a whole bunch of new research questions to ask the dataset! I guess I should update this dataset again too 😃 |
- Update to include `last_modified_date` were applicable - Rewrite text-analysis (extraction) documentation
- Update to include `last_modified_date` were applicable - Rewrite text-analysis (extraction) documentation
GitHub issue(s): #546
What does this Pull Request do?
Implements extracting
last_modified_date
of a resource where available.getLastModified
forSparklingArchiveRecord
CovertLastModifiedDate
to convert RFC 1123 dates toyyyyMMddHHmmss
last_modified_date
column for.all()
.webpages()
.images()
.pdfs()
.audio()
.videos()
.spreadsheets()
.presentationProgramFiles()
.wordProcessorFiles()
.css()
.html()
.js()
.json()
.plainText()
.xml()
Example:
How should this be tested?
Additional Notes:
This is going to require A LOT of documentation updates.
Interested parties
@digitalshawn 👋