Skip to content
New issue

Have a question about this project? # for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “#”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? # to your account

Last modified headers #547

Merged
merged 5 commits into from
Nov 7, 2022
Merged

Last modified headers #547

merged 5 commits into from
Nov 7, 2022

Conversation

ruebot
Copy link
Member

@ruebot ruebot commented Nov 2, 2022

GitHub issue(s): #546

What does this Pull Request do?

Implements extracting last_modified_date of a resource where available.

Example:

import io.archivesunleashed._
val data = "/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz"
RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date").show(20, false)

// Exiting paste mode, now interpreting.

[2022-11-02T16:05:35.325Z - 00000 - HdfsIO] Opening file file:/home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz (Offset: 0, length: 0, decompress: false, strategy: BlockWise [dynamic])
+--------------+------------------+                                             
|crawl_date    |last_modified_date|
+--------------+------------------+
|20091027143300|                  |
|20091027143259|20000923233454    |
|20091027143259|20020913163029    |
|20091027143300|20020211154553    |
|20091027143259|19980919164703    |
|20091027143259|20080125150303    |
|20091027143300|20010921224658    |
|20091027143258|20081009015203    |
|20091027143300|                  |
|20091027143259|20020416145103    |
|20091027143300|20090223022835    |
|20091027143300|20030928090558    |
|20091027143300|20091027143300    |
|20091027143300|20021203212451    |
|20091027143300|                  |
|20091027143300|20040530033010    |
|20091027143300|                  |
|20091027143259|20090223022352    |
|20091027143300|                  |
|20091027143300|20010608202736    |
+--------------+------------------+
only showing top 20 rows

import io.archivesunleashed._
data: String = /home/nruest/Projects/au/sample-data/geocities/GEOCITIES-20091027143300-00114-ia400112.us.archive.org.warc.gz

How should this be tested?

  • Tests should take care of it.
  • I'm also going to test this at scale with the GeoCities dataset.

Additional Notes:

This is going to require A LOT of documentation updates.

Interested parties

@digitalshawn 👋

@codecov
Copy link

codecov bot commented Nov 2, 2022

Codecov Report

Merging #547 (c3af611) into main (8a4bf54) will increase coverage by 0.00%.
The diff coverage is 95.23%.

Additional details and impacted files
@@            Coverage Diff            @@
##               main     #547   +/-   ##
=========================================
  Coverage     93.87%   93.88%           
- Complexity       48       49    +1     
=========================================
  Files            44       45    +1     
  Lines           980     1030   +50     
  Branches         52       55    +3     
=========================================
+ Hits            920      967   +47     
- Misses           36       38    +2     
- Partials         24       25    +1     

@ruebot ruebot marked this pull request as ready for review November 2, 2022 16:12
@ruebot ruebot requested a review from ianmilligan1 November 2, 2022 16:12
@ruebot
Copy link
Member Author

ruebot commented Nov 2, 2022

Oh, that blew up at scale real quick 😅

java.time.format.DateTimeParseException: Text 'Fri, 23 Oct 2009 20:0830 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 830

java.time.format.DateTimeParseException: Text 'Sat, 24 Oct 2009 14:3224 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 3224

java.time.format.DateTimeParseException: Text 'Tue Aug 22 17:18:37 2000 GMT' could not be parsed at index 0

java.time.format.DateTimeParseException: Text 'Thu, 22 Oct 2009 23:1407 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 1407

java.time.format.DateTimeParseException: Text 'Sun, 25 Oct 2009 00:3030 GMT' could not be parsed: Invalid value for MinuteOfHour (valid values 0 - 59): 3030

It's what I feared, that these dates are gonna not conform to RFC 1123, and be pretty wild.

@helgeho
Copy link
Contributor

helgeho commented Nov 3, 2022

Here's a more fuzzy approach to convert these dates (RFC 1123 or similar) into yyyyMMddHHmmss:

val months = Seq("jan", "feb", "mar", "apr", "may", "jun", "jul", "aug", "sep", "oct", "nov", "dec").zipWithIndex.map{case (s,d) => (s, ("0" + (d + 1)).takeRight(2))}
"""
Fri, 23 Oct 2009 20:0830 GMT
Sat, 24 Oct 2009 14:3224 GMT
Tue Aug 22 17:18:37 2000 GMT
Thu, 22 Oct 2009 23:1407 GMT
Sun, 25 Oct 2009 00:3030 GMT
""".split("\n").map(_.trim).filter(_.nonEmpty).flatMap { str =>
    val lc = str.toLowerCase
    months.find(m => lc.contains(m._1)).map(_._2).flatMap { m =>
        val d = str.replace(":", "").split(' ').drop(1).map(d => (d.length, d)).toMap
        for (y <- d.get(4); n <- d.get(2); t <- d.get(6)) yield y + m + n + t
    }
}

output:

Array(
  "20091023200830",
  "20091024143224",
  "20000822171837",
  "20091022231407",
  "20091025003030"
)

@ruebot
Copy link
Member Author

ruebot commented Nov 4, 2022

At scale test with GeoCities was successful.

import io.archivesunleashed._
val data = "/tuna1/scratch/nruest/geocites/warcs"
val test = RecordLoader.loadArchives(data, sc).all().select($"crawl_date", $"last_modified_date", $"mime_type_web_server", $"mime_type_tika")
test.write.option("timestampFormat", "yyyy/MM/dd HH:mm:ss ZZ").format("csv").option("escape", "\"").option("encoding", "utf-8").save("/tuna1/scratch/nruest/aut-547-test")
$ wc -l aut-547-test.csv 
317151386 aut-547-test.csv
$ head -n25 aut-547-test.csv 
crawl_date,last_modified_date,mime_type_web_server,mime_type_tika
20091025200809,20090223021853,text/html,text/html
20091025200809,20010725152023,text/html,text/html
20091025200809,19980216015402,image/gif,image/gif
20091025200809,20090223022127,text/html,text/html
20091025200809,20021211215621,image/gif,image/gif
20091025200809,20050419153723,text/html,text/html
20091025200809,20090223023317,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,20060508201301,text/html,text/html
20091025200809,20090223023317,text/html,text/html
20091025200809,19980308054339,image/gif,image/gif
20091025200809,20090223023317,text/html,text/html
20091025200809,20001212220521,image/jpeg,image/jpeg
20091025200809,20071102161255,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,20090223021623,text/html,text/html
20091025200809,20090223021623,text/html,text/html
20091025200809,20001211234323,text/html,text/html
20091025200809,20090218194412,text/html,text/html
20091025200809,19990102025342,image/jpeg,image/jpeg
20091025200809,20090223022127,text/html,text/html
20091025200809,20090223022127,text/html,text/html
20091025200809,"",text/html,text/plain
20091025200809,"",text/html,text/plain
20091025200809,20000917151720,image/gif,image/gif

If you're good with this @ianmilligan1, let me know and I'll squash this all down and merge. Then work on documentation updates, and a release next week, as well as getting it pulled into ARCH.

Copy link
Member

@ianmilligan1 ianmilligan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great @ruebot , and the GeoCities test is illuminating. It would be great to down the line do a little study to check on the reliability of this approach.

@ruebot
Copy link
Member Author

ruebot commented Nov 7, 2022

@ianmilligan1 yeah, I was looking at all the dates come in from GeoCities, my brain was being swarmed with a whole bunch of new research questions to ask the dataset! I guess I should update this dataset again too 😃

@ruebot ruebot merged commit cdf8e76 into main Nov 7, 2022
@ruebot ruebot deleted the last-modified-headers branch November 7, 2022 17:43
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 9, 2022
- Update to include `last_modified_date` were applicable
- Rewrite text-analysis (extraction) documentation
ruebot added a commit to archivesunleashed/aut-docs that referenced this pull request Nov 16, 2022
- Update to include `last_modified_date` were applicable
- Rewrite text-analysis (extraction) documentation
# for free to join this conversation on GitHub. Already have an account? # to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Include last modified date for a resource
3 participants