-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathdocweblib.html
1469 lines (1217 loc) · 69.3 KB
/
docweblib.html
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
<html>
<head>
<!-- Global site tag (gtag.js) - Google Analytics -->
<script async src="https://www.googletagmanager.com/gtag/js?id=UA-1043620-1"></script>
<script>
window.dataLayer = window.dataLayer || [];
function gtag(){dataLayer.push(arguments);}
gtag('js', new Date());
gtag('config', 'UA-1043620-1');
</script>
<title>Document Management, Digital Libraries and the Web</title>
<!-- Name: Larry Masinter -->
<!-- Address: Xerox PARC
3333 Coyote Hill Road
Palo Alto, CA 94304 -->
<!-- Phone: (415) 812-4365 (do not call) -->
<!-- Email: masinter@parc.xerox.com -->
<link rel="stylesheet" href="theme.css">
</head>
<body>
<h1>Document Management, Digital Libraries and the Web</h1>
<p>
June 9, 1995
<P>
<A HREF="#BIO"><B>Larry Masinter <masinter@parc.xerox.com></B></A>
<HR>
<!-- *** global comments:
renumber references, z39.50 out of order
reference PARC work!!!
more about desktop integration; annotation, conversion operating
systems support, printing, asynchronous operations.
Printing:
DM does a respectable job of supporting printing, the web is
awful, especially getting closure on the true document that has
been shredded into sections. either not enough meat or not
enough leading. Some of the references are just too 'flip',
should call on more experience.
Another issue: distribution: transactions, reliability, response
time.
Another characterization: document management: content is
updated frequently. Digital library: content is relatively static.
DE: I am left in a number of places wanting you to tell me
more about what you think or what I should conclude. For example,
many of the issues you raise seem hard problems in their own
rights, and hardly become more tractible when they have to be
dealt with across two or three of the domains. Tell where you
see or want the world to go in your final section.
AP:
1. I was missing the resource discovery problem. It does seem different
from the search problem, and it comes up in all three areas: Doc Man is
weakest in it, but even there, you could imagine document archives to be
hard to find. Ones you know they exist, *then* you can search for docs
in them. In a company, you could well imagine the existence of a whole
archive not being known between departments or operating divisions. In
DLs and the Web, the resource discovery problem is obvious.
2. You *might* consider adding collaboration support issues. These are
clearly there for doc man. You hit it tangentially when introducing doc
man, but from a doc life cycle angle, not a human/human collaboration
angle. You get MUDs and all sorts of other examples. In DLs you could
bring in the collab angle as allowing client/ref-librarian common
search. On the web you could talk about collaborative publishing, link
maintenance, etc.
-->
<h2>Abstract</h2>
<em> Document management systems are used by individuals, office
workgroups and enterprises to organize and keep track of the documents
being produced as a part of their work. Digital Library technology is
being developed by many organizations to make the world's knowledge
available through computers and communication technology. The
World-Wide Web is an Internet application being used by individuals,
companies and other organizations for promoting themselves, their
products, doing electronic commerce, and for providing information to
the vast number of Internet users around the world. These three
application areas have much in common and also significant
differences. The paper notes the common elements and some of the
technical issues common in these areas, and explores the opportunities
for synergy when these applications merge. </em>
<hr>
<A NAME="CONTENTS"><H2>Contents</H2></A>
<ul>
<li> <A HREF="#intro">1. Introduction</A>
<li> <A HREF="#dmover">1.1 Document Management Overview</A>
<li> <A HREF="#dlover">1.2 Digital Libraries Overview</A>
<li> <A HREF="#webover">1.3 The Web: an Overview</A>
<li> <a href="#common">2. Common Elements</a>
<ul>
<li> <a href="#docids">2.1 Document Identifiers</a>
<li> <a href="#metadata">2.2 MetaData</a>
<li> <a href="#aaa">2.3 Authentication, Authorization and Accounting</a>
<li> <a href="#types">2.4 Document types</a>
<li> <a href="#search">2.5 Searching</a>
</ul>
<li> <a href="#opportunities">3. Opportunities</a>
<li> <A href="#refs">References</A>
<li> <A href="#acks">Acknowledgments</A>
</ul>
<HR>
<h2><a name="intro">1. Introduction</a></h2>
The terms "document management system", "digital library" and
"World-Wide Web" describe applications with a number of common
architectural elements, though they are distinct in many of their
features, in their domains of use, and in the systems and protocols
they involve. This <a href="#intro">first section of paper</a>
describe each of the areas, their critical properties, some examples
of their use, and the systems, standards, and organizations involved
in developing them. <a href="#common">Section 2</a> then explores
many of the common design issues that are facing developers in each of
the areas. Finally, <a href="#opportunities">Section 3</a> sets out
some of the opportunities for integrating the three application areas.
<!-- Suggestion: talk about size of document management marketplace,
size of digital library efforts, and number of web sites -->
<h2><a name="dmover">1.1 Document Management Overview</a></h2>
Document management systems are software packages designed to help
individuals, workgroups and large enterprises manage their growing
number of documents stored in electronic form<a href="#ref1">[1]</a><a
href="#ref2">[2]</a>. Document management is seen as a way to help
companies manage the intellectual property that is locked up in the
company's documents, currently hidden away in a morass of directories
and subdirectories in scattered file servers across their networks.
Document management systems may be used for a workgroup (a group of
users connected via a local area network) or an enterprise (everyone
in a company, connected via a corporate network).
<p> Document management is used to manage the entire life cycle of a
document, from creation through multiple revisions and finally into
long-term storage and records management. For example, workgroup
document management systems often offer library services for
preserving update consistency, similar to check-out and check-in
capabilities of software source code control systems. When a user
checks out a document, the system locks the document from other users'
changes. When the document is checked back in, the document
management system makes it available for others to revise. Along with
maintaining update consistency, the document management application
tracks revisions in a multi-author/editor setting.
<p> Document management systems usually feature searching in
repositories of documents both by externally applied information about
the documents (e.g., user who entered it, date of revision, or version
relationship) and by content (e.g., search on words contained within
the document.)
<p> Frequently, document management systems are integrated with
imaging capabilities: the ability to deal with scanned raster images
(fax quality or higher) of documents that originated in paper form, as
well as with documents that originated in electronic form. While
imaging applications traditionally had been a separate domain, the
line between image management and general document management has been
increasingly blurred in recent years. In image document management
systems, optical character recognition (OCR) is used to analyze the
document content and index the corpus for content retrieval, even when
the documents themselves are retained in image form.
<!-- should have reference on OCR and text retrieval?
give references for image management systems, including XDOD
-->
<p> Document management systems are usually integrated with the
desktop applications. That means that the user's application program
-- word processor, spreadsheet, graphic editor -- is modified to work
directly with the document management system. For example, if a user
running WordPerfect pulls down on the "File/Open" menu, a search
interface to the document management repository might appear rather
than the standard file system dialog interface.
<!-- should have other examples of integration -->
<p> Document management systems are sometimes connected to or
integrated with workflow systems, though the latter is strictly
speaking a different application. While document management systems
deal with storing and searching documents in repositories, workflow
systems are organized around work processes. Thus, a workflow system
contains a model of the tasks of an organization and the roles that
individuals play in that organization, and routes the work according
to the model of the work process. Of course, the results of that
process are often stored in document management repositories, and
document management operations are often steps in the tasks managed by
the workflow system.
<h3>Applications of Document Management Systems</h3>
<p> To make clear the function of document management applications, it
may help to give some typical examples of how these systems are used:
<ul>
<li> A large multinational law firm manages all of its
correspondence and contracts in a document management system.
Because the firm believes it has an obligation to offer similar
legal advice to all clients in similar situations, the company
wants the system to keep track of all correspondence,
contracts, and so forth as produced in each of its offices.
<li> A large aerospace company finds that almost every plane off
their assembly line is different in configuration. The
documentation for the repair and maintenance of the plane needs
to match the configuration shipped. The document management
system allows the configuration of the shipped documentation to
match the product. As more and more manufacturers move into
custom product delivery and just-in-time manufacturing, it has
become increasingly important to have a system that can allow
documentation to track the changes in the products.
<li> Offices accumulate large repositories of general correspondence
and often look for smaller document management applications for
tracking correspondence and business documents.
</ul>
<p> There are a large number of vendors of document management systems.
Some of the major products and vendors include Documentum, PC Docs,
SoftSolutions from WordPerfect/Novell, FileNet, Visual Recall from
Xerox, and Mezzanine from Saros. Many other products include document
management capabilities, including offerings from Verity, Oracle, and
Lotus (Notes).
<p> As document management products have developed, there has been a
growing demand for standards to allow interoperability between them.
Large enterprises discover that different workgroups within their
organization have, for various reasons, chosen different document
management products. As they attempt to integrate these products
across the enterprise, enterprise-wide standard interfaces and
interoperability become increasingly important.
<p> To this end, consortia have organized to define standards for
document management. For example, the Open Document Management API
(ODMA) is a simple Application Program Interface (API) designed to let
desktop applications (such as an editor or spreadsheet) integrate with
any of a number of document management systems<a href="#ref3"
>[3]</a><a href="#ref4" >[4]</a><a href="#ref5" >[5]</a>. It
redefines file access menu items such as "Open", "Save", and "Save
as..." to call the document management system (if one is installed)
instead of the file system.
<p> At another level, there have been recent attempts by industry
groups to define a middleware layer between the user interface and
back-end document repositories, so that users in an enterprise can
access documents stored in multiple document management systems across
their enterprise. The two efforts by the Shamrock Document Management
Coalition (Shamrock's Enterprise Library Services) and the Document
Enabled Networking<a href="#ref6">[6]</a> specification are being
merged into a new Document Management Alliance (DMA)<a href="#ref7"
>[7]</a> to promote a single standard interface. These initiatives
are creating a set of standard interfaces that define system elements
such as "document", "repository", and "attribute" as well as as
operations such as searching, checking out a document, and retrieving
it.
<!-- Somewhere here: uniformity vs. unification; one way to
standardize is to give a uniform interface that covers many kinds of
interfaces, another is to map between them. This is an issue in
attributes and search too -->
<h2><a name="dlover">1.2 Digital Libraries</a></h2>
What is a digital library? The term is sometimes used in a relatively
literal way to refer to a system or application whose function is
chiefly to extend the reach of a conventional library, for example by
making its collection available in electronic form to remote users.
More abstractly, the term is used to describe any application or
system aimed at providing access and services for a large electronic document
corpus. Usually the users of such corpora are thought of as members
of a general or specialized public, rather than the personnel of an
organization or enterprise. Over the last few years there have been
research and development projects of both types; see, for example, <a
href="#ref8">[8]</a><a href="#ref9">[9]</a><a href="#ref10">[10]</a><a
href="#ref11">[11]</a> and special issues of journals<a
href="#ref12">[12]</a>. For all their differences and particularities,
these projects have certain general characteristics in common.
<h3>Key Features of Digital Libraries</h3>
<p> Digital libraries usually possess large corpora of information of
generally high value. Not only is the material of high quality, but
also some care is placed on cataloging the material, and making sure
that the origin, date, and other external descriptive information is
accurate. Many digital library projects are concerned with providing
digital access to material that already exists within traditional
library collections, and thus concentrate on material that was
originally intended for analog media: libraries of scanned images of
photographs or printed texts, digitized video segments and so forth.
Other projects extend the library metaphor to other collections such
as scientific data sets, software libraries or multimedia works. A
great deal of work in this area concentrates on providing enhanced
content or access methods, with the problem often couched as one of
providing a way of satisfying the individual's particular "information
needs". This might be a chemistry graduate student looking for
information for a research project, a high-school student downloading
a multi-media chemistry text, or a market researcher looking for
information about chemical companies.
<h3>Digital library systems and standards</h3>
<p> While much digital library work is in its early phase of
development, there is a rich tradition in the library community that
has influenced the thinking and design of systems for Digital
Libraries. Historically, library automation has taken the form of
Online Public Access Catalogs (OPACs). The standards for online
library catalogs include MARC<a href="#ref13">[13]</a> and Z39.50<a
href="#ref27">[27]</a>. Another kind of metadata is represented by
the Scientific and Technical Attribute Set (STAS), which defines a
standard for metadata elements to describe scientific datasets as
opposed to traditional bibliographic material.
<p> More recently, a number of research initiatives have proposed
systems and mechanisms for future digital libraries, including the
six NSF/ARPA/NASA joint initiative projects, initiatives of the
national libraries and library system vendors. Previous work in
copyright management<a href="#ref14">[14]</a><a href="#ref15"
>[15]</a>, document identifiers<a href="#ref16">[16]</a>, and the
Computer Science Technical Report project <a href="#ref17">[17]</a>
also contribute to digital library technology.
<h2><a name="webover">1.3 The web: an overview</a></h2>
<p> These days, it is hardly necessary to define "the web" at an
Internet conference. (It's hardly necessary to define "the web" to the
cab driver who takes you to the conference from the airport.) For the
sake of contrast, though, it will be useful to lay out the web's key
features here.
<h3>Key Features of the web</h3>
<p> By "the web", I mean information on the Internet, as is accessed
by individuals using a World-Wide Web or some other network
information access tool. The web is accessed using one of the many
web browsers now available. The web provides a <em>document
interface</em> to information. That is, a users is presented with a
document which includes links to follow and forms to fill out. By
interacting with the document, the user causes a new document to be
presented. The web, as an Internet service, is primarily public. A
web site can provide access to a very large number of users across the
world.
<h3>Example applications of the web</h3>
<p> The web is used for institutional public relations and product
information, personal communication, online publishing, and
scientific, technical and scholarly interchange. For example,
companies put up web sites about their products and services; a
growing number of newspapers and information service providers are
producing web sites. Students put up `home pages' covering their
hobbies. Professional organizations and educational institutions give
out information about their organizations and their resources.
<h3>Web systems and standards</h3>
<p> There are a growing number of web systems and software packages,
including those produced by sponsored research, university researchers
and commercial vendors. Dozens of start-ups compete for attention.
<p> The web systems and protocols, originally defined in the research
community, are being refined by a number of companies and consortia
(the W3C consortium, for example) and being standardized by working
groups of the Internet Engineering Task Force (IETF). The IETF is
developing standards for Uniform Resource Locators (URLs), Uniform
Resource Names (URNs), the HyperText Transfer Protocol (HTTP), and the
HyperText Markup Language (HTML). These elements are the principal
elements of the World Wide Web. The web also includes other network
search protocols and access systems. For example, the Gopher protocol
defined by the University of Minnesota is part of the web, while the
Internet use of the Z39.50 standard is defined by the Z39.50
Implementors Group (ZIG)<a href="#ref18">[18]</a>.
<!-- ref article in Science? -->
<h2><a name="common">
2. Common Elements in Document Management, Digital Libraries and the
Web
</a></h2>
The three application areas of document management, digital libraries
and the web share common technology elements. This section describes
some of these common elements, how they're deployed in each area, and
the general design problems that are shared by all three areas. With
more coordination between the groups designing the systems and
protocols in these areas, solutions that are deployed for one set of
applications might be reapplied in others, duplicate effort avoided,
and the opportunities for synergy enhanced.
<h3><a name="docids">
2.1 Document Identifiers
</a></h3>
In any computer system for manipulating information, it is important
to allow objects to contain persistent references to other objects.
These references are used from inside databases, in bibliographies,
hypertext links, and in a variety of other ways. The approaches used
in document management, digital libraries and the web have differed.
<h4>Identifiers in Document Management systems</h4>
Commercial document management systems all employ some kind of
document identifier mechanism, so that pointers to documents in the
document management system can be saved and referenced independent of
that system. For example, ODMA has a document ID -- a persistent,
portable identifier for a document -- that is accepted or returned by
ODMA functions. It is used to save away references to documents, to
refer to documents in electronic mail or by other processes. Other
examples of document identifiers include those used in OpenDoc<a
href="#ref19" >[19]</a> and OLE. The OpenDoc standard uses the Bento
file format<a href="#ref20" >[20]</a>, which incorporates globally
unique identifiers to make references from one document to another.
OLE use a variety of identifiers to keep permanent references valid
between composite objects<a href="#ref21">[21]</a>.
<h4>Identifiers in Libraries</h4>
Traditionally, the library community has developed a number of
mechanisms to uniquely identify a work. These mechanisms include "call
numbers" (e.g., the Library of Congress Call Number system which
yields identifiers that are printed like PS3566O815.W4.1987), ISBN
numbers (originally intended for inventory) and ISSN numbers (which
identify serials, i.e., material that is updated regularly.) More
recently, librarians have tried to apply this apparatus to digital
works, which do not always lend themselves to traditional treatment
and which raise a number of design issues involving the use of
document identifiers<a href="#ref22">[22]</a>.
<h4>Identifiers on the Web</h4>
In the World-Wide Web, the most common kind of identifier is a URL.
URLs are probably familiar to anyone who has used a web browser or
read the papers in this conference, where the references include URLs.
While the name "URL" seems to indicate that it locates the object
(says `where it is'), in fact, a URL is more like an `access method':
it tells you how, on the Internet, to access the object.
As many have observed, there is a serious problem using URLs when
information or web resources move. There is a strong desire to create
a new scheme for URNs that name an object independent of its location.
Some kind of distributed URN -> URL location service (for which there
is not yet an accepted design) would then be employed to find out the
actual location of objects. Several proposals have been brought
forward and are being evaluated.
<h3>Issues in Document Identifiers</h3>
There are a number of open design issues in the area of document
identifiers. These design issues are present for dealing with
electronic documents, whether in a library, a workgroup, or on the
Internet.
<!-- Another issue: case sensitivity, human readable vs. machine
generated, trademarks and lifetime -->
<h4>Fragments, relationships</h4>
<p> How does one identify a piece of something else? For example, if
there is a volume of collected papers, do the individual papers get
separate identifiers? If so, is the identifier for each element somehow
syntactically related to the identifier for the whole? If not, how is
the relationship established? Is there a database that links the part
to the whole?
<p> When an object is revised, does it retain its identifier? For
example, in System 33[23], every document had two identifiers: one
that was assigned to `this version' and another that specified `the
latest version of whatever this becomes'.
<p> In the office environment, a document with a cover memo attached
might be considered a different object. However, in some situations,
the `cover' material is merely an external attribute, and the document
hasn't changed and should not get a different identifier.
<p> In general, there are a large number of relationships between
objects that can be expressed as relationships of the identifiers of
the objects, and relevant design decisions are currently made in
an ad hoc fashion. Publishers are allowed to retain the same ISBN
number for minor printing revisions, but the paperback and hardcover
of a book are given different ISBN numbers. On the web, the URL of a
document doesn't change if the content changes. Moreover, different
vendors' document management systems seem to take different approaches
to dealing with revision and identity.
<!-- DE: examples of problems these cause? -->
<h4>Uniqueness</h4>
There are a variety of methods used to ensure that different documents
do not get the same identifier, even when different entities are
assigning names. These methods rely either on a distributed hierarchy,
or a probabilistic method of name assignment.
<p> In a hierarchical uniqueness system, there is a tree of 'naming
authorities'. Every naming authority guarantees that it will not give
out the same identifier to two different documents. If it delegates
some of the naming authority to sub-authorities, it also delegates
that promise. ("Here, you can give out names, but you make sure you
never give out the same name twice.") For example, the Internet's
Domain Name Service is a hierarchical service; the owner of
"xerox.com" can hand out unique names under that suffix, and to
delegate the naming system underneath to the owner of
"parc.xerox.com". Many of the proposals for URNs on the Web are
hierarchical.
<p>Some distributed naming systems are hierarchical but have a fixed
depth of the hierarchy. For example, ISBN numbers have three parts: a
country code (the country of registry for the publisher), the
publisher identifier, and, for each publisher, the document
identifier. Each publisher is allowed to assign their own ISBN
numbers. Some naming systems are not distributed, but guarantee
uniqueness by keeping a single source of identifiers; for example, the
Library of Congress Control Number is assigned uniquely by the U.S.
Library of Congress.
<p> A random naming authority is one in which names are given out
using random numbers; each authority uses enough information to make
the probability of two documents getting the same identifier quite
small. For example, some schemes use the one-way hash (MD5, SHA) of
the document as the document identifier. The LIFN system <a
href="#ref24">[24]</a> uses a randomly assigned document identifier in
this way.
<!-- Need reference for MD5, SHA? opaque vs semantic names? Examples -->
<h4>Resolution</h4>
Given a name for an object, how does one go about finding information
about that object? How much information is packed into the name? For
example, ISBN numbers give you some clue about who the publisher is,
and there is a global registry of publishers. If you can't find the
document in your catalog, you can check the publisher. On the other
hand, the random schemes give no hints. Using URLs, the identifier
contains nearly complete information to access a resource across the
global Internet. Usually, though, the more information contained in the
identifier, the harder it is to for the resolution system to find
objects when they have moved.
<!-- explain some resolution systems! Examples! -->
<h3><a name="metadata">2.2 MetaData</a></h3>
In document management, digital libraries and the web, it is common to
want to record information about documents that is not part of the
documents themselves. These assertions are sometimes called `document
attributes'; sometimes they are called `metadata' to signify that they
are data about data rather than the information itself. Metadata
assists in the description, organization, discovery and access to
network information resources.
<h4>Metadata in Document Management</h4>
Most document management systems include mechanisms that permit at
least the system administrator to define, according to the
application, a set of attributes that are common to the documents in a
repository or at least a variety of classes of documents. For
example, many systems record the user identity of the originator of
the document, the date and time of origination, other information
external to the documents themselves, or some other attributes of the
documents in the repository, as determined by the system
administrator. A law office might index its documents by the name of
the client; a manufacturer, by the product or parts codes affected
within.
<h4>Metadata in Digital Libraries</h4>
<!-- AP: you are starting to deviate from your definition of metadata.
Most of the time you mean metadata to be a description of the
(search) fields in a doc. But in this par you say that bibliographic
records are metadata. But these records have things like publication
data, number of pages, etc. That is, of course, metadata as well.
But you need to enlarge your definition to explicitly include both
purposes of metadata
-->
<p>Libraries have traditionally been quite concerned with cataloging
-- a process which associates metadata with bibliographic material.
The card catalog entries for an item in the library provides metadata
about the item. There are a variety of standards used for online
cataloging. The most prominent is USMARC. Various attempts have been
made to extend and enhance USMARC to deal with online material<a
name="#ref25">[25]</a><a href="#ref26">[26]</a>. The Z39.50
standard contains extensive mechanisms for
both communicating search parameters (requested metadata) and document
attributes (output metadata.)
More recently, attempts to define online document standards for the
humanities arrived at a standard set of metadata for humanities
texts<a href="#ref28">[28]</a>.
<!-- band: for the time being -- furthermore these are issues of local
interpretations of standards -->
<h4>Metadata on the Internet</h4>
The Internet community has several efforts to define a set of metadata
tags useful for information on the network. For example, the Internet
Anonymous FTP Archives working group of the Internet Engineering Task
Force attempted to set a standard for describing FTP-accessible data<a
href="#ref29">[29]</a>. In fact, one could think of the standard
headers of an Internet electronic mail message as identifying
attributes for each message<a href="#ref30">[30]</a>. Every Internet
message has required attributes; for example, it must identify who it
is "From" and "To" and the "Date" it was sent. In addition, there are
optional attributes, such as "Subject" and "Comments". There are
rules that specify the kinds of values each attribute can have.
<p> The Uniform Resource Identifier working group<a href="#ref31"
>[31]</a> has been trying to develop a standard syntax and
representation for information citations in a scheme called Uniform
Resource Citations (URCs) to describe information on the Internet as
a way of discovering or describing more about a referenced resource
(via URL or URN) before retrieving the item, as well as a way of
cataloging Internet information.
<h3>Issues in Metadata</h3>
<p> There are a number of design issues in representing metadata for
online information, some semantic (what does it mean and how do you
say it?), some structural (does metadata have structure?) and some
syntactic (how do the semantics and structure get represented as a
sequence of characters or bytes?) These issues span the three
application areas.
<h4>Semantic issues</h4>
<p> Are there well known attributes? MARC takes a strong stand: MARC
defines a set of well-known attributes with descriptions of each. Some
of them take on values within a controlled vocabulary. There are
standards for the completeness and quality of a catalog entry. The
set of attributes is defined and used universally by nearly all online
library catalogs. In document management systems, on the other hand,
the system administrator for a workgroup generally establishes
conventions for the attributes used and what they mean. When multiple
document management systems are brought together, though, combining
the semantics of the disparate sources is a serious problem. The
Internet community is struggling with standardization of semantics for
attribute sets. While there are some attributes that are well-known
(content attributions in mail messages, mapping to ISO protocols in
X.400), these are by no means universal.
<p> If there is not a single well-known set of attributes that spans
all known objects, then it is still possible to create a system of
<dfn>entities</dfn> -- classes of documents which share the same
schema of attributes. For each class, the attribute set can then be
defined. For example, a document management system might allow for
'memo' and 'spreadsheet' and 'expense report'. Every memo might be
catalogued by its distribution list, while an expense report might be
required to have a budget center and a signature status. More complex
schema systems allow for inheritance and specialization of classes, as
is found in object-oriented programming. There are variations among
different implementations, just as there are in different
object-oriented programming systems.
<!-- examples of choices made by various systems. band: the solutions
are not without serious drawbacks.
Issue is reconciliation in merged searches.
What are the overall conclusions and concerns one should
draw from this section?
-->
<h4>Structural issues</h4>
<p> Frequently it is difficult to tell the `boundaries' of an online
electronic work. If one describes a site's `home page', does the
description apply to the site, or just to the introductory `splash
page'? If an object contains parts, do the parts have separate
attributes? For example, if a report in a document management system
has a cover memo, in what way are the author of the report and the
author of the cover memo distinguished or reported in the description
of the overall object?
<!-- band: locally constructed, situated -->
<p> Metadata itself can also have structure. It is sometimes necessary
and occasionally critical to know the author of an attribute or the
time when the attribute was assigned. If metadata itself can be
updated and revised, then the history of its editing may be of
relevance. How does one distinguish between `the title' and `the
title, translated into French', and `the title, translated into
English from Italian by D.H.Lawrence'. The relationships between
elements of the metadata are problematic for some flat attribute-value
representation schemes like MARC.
<!-- note that no system has done much with metadata structure
band: how depends on who and where.
Elliot: more on the last sentence: flat attribute-value
problems.
-->
<h4>Syntactic and system issues with metadata</h4>
<p> While it might seem straightforward, standardization of the
syntactic mechanisms for representing the semantics and structure of
attributes is quite difficult. First, attributes might have a fixed,
extensible, or uncontrolled set of values. The mechanisms for
assigning the allowable elements of the controlled set are difficult
to establish. Each attribute or field might need to deal with
alternative syntaxes (e.g., for names, is it last name first or given
name first?), multiple character sets (names in Chinese or Arabic), or
even non-textual data.
<!-- ASN.1 vs textual encoding debates.
DE says: so what should be done?
-->
<h3><a name="aaa">
2.3 Authentication, Authorization, Accounting (AAA) and Related Issues
</a></h3>
There are several related issues having to do with security, rights,
privacy, confidentiality and access that arise in all of the
application areas. Authentication is the process by which the
identity of a person (or system) is ascertained and assured.
Authorization is the process of determining whether a given operation
is allowed, such as reading a document or updating metadata.
Accounting is the process of recording operations and the payment due
for them. An audit trail of records of past operations might be kept,
as a way of checking the integrity of the system.
<h4>AAA in Document Management</h4>
In document management systems, the critical elements of AAA are
concerned with managing the permissions to access the information in a
set of documents and maintaining the integrity of these documents.
Some documents are confidential, others are public, others belong to
particular workgroups. Most of the early work in authorization
followed the military model of classified information and clearance
levels; this model has been found to be inappropriate for many
non-military applications.
<!-- AP: Give reference for this assertion -->
Frequently, the authorization system of the document management system
is inadequate to represent and enforce the company's access control
needs; for example, the actual work practice in many organizations
will relax rules and guidelines in specific situations.
<!-- what are the alternatives? -->
<p> Despite the more complex needs, some document management systems
rely on either their database manager or the host network operating
system to provide authentication and access control, if for no other
reason than to avoid providing a separate authentication and
administrative domains.
<!-- DE: providing common authentication & access control across
NOSes would be a major engineering effort, especially given
differences in the security models of different NOSes. -->
<h4>AAA in Digital Libraries</h4>
In the library setting, the requirements for AAA often focus on
copyright, payment methods, and usage rights; in addition, there is a
significant concern for the privacy of the reader and information
about what is being read by whom. The situation is made more complex
by the difficulties in interpreting copyright law originally designed
for physical material in a world of electronic reproduction and
distribution. In many countries, the copyright law and practice around
it is being reexamined in the age of electronic distribution. In any
case, it is clear that digital library systems will need to address
issues of copyright and intellectual property rights before they can
be widely deployed.
<!-- AP:
In 'AAA in Digital Libraries', you might need to introduce payment as
being relevant as well. For-pay information will probably be part of the
picture.
-->
<h4>AAA in the Web</h4>
<p>The Internet community has a large number of separate efforts
defining security standards. The web community is exploring two
systems, Secure HTTP (S-HHTP)<a href="#ref32">[32]</a> and Secure
Socket Layer (SSL)<a href="#ref33" >[33]</a>. S-HTTP is a modification
of the HTTP web protocol that includes security features. SSL is an
application-independent protocol for negotiating secure network
communication. Recently these efforts have joined forces. In
addition, new authentication mechanisms for web access (other than
simple passwords) are being proposed using Digest Access
Authentication<a href="#ref34" >[33]</a> and Multi-party Digest
Authentication<a href="#refXX" >[XX]</a>.
<p>In addition, the Internet mail community has produced two
complementary systems for secure electronic mail, Pretty Good Privacy
(PGP)<a href="#ref36">[36]</a> and Privacy Enhanced Mail (PEM)<a
href="#ref37">[37]</a>. PGP is a public key cryptosystem with a number
of utilities for dealing with keys and mail. PEM is a system for
providing privacy enhancement services (confidentiality,
authentication, message integrity assurance and non-repudiation of
origin) using either symmetric (secret-key) and asymmetric
(public-key) approaches for encryption of data encrypting keys.
<!-- AP:
Assumption of reader sophistication is different for this phrase than for
the other material. -->
There is some hope that all of these separate efforts will eventually
converge.
<p>Beyond the mechanisms for dealing with security, copyright and
intellectual property, the Web is capable of providing for spontaneous
financial transactions. A number of mechanisms for handling payment
and billing are being explored, either through credit card settlement
methods or digital cash<a href="#ref38">[38]</a>.
<!-- AP: Spontaneous sounds weird. What does this mean? -->
<p>The most serious issue is the design of an authorization scheme
that will scale to the size of 'all users on the Internet', given the
enormous international scope of the Internet and the wide variety of
needs and policies requiring support.
<p> Finally, US export control laws that govern the export of
cryptographic software have been perceived as a difficult impediment
to widespread deployment of secure software solutions to the Web's
problems.
<h4>AAA Issues</h4>
It is clear that the main issues in each domain (intellectual property
in libraries, complex authorization needs in document management, and
secure communication for spontaneous transactions on the web) will
also become important in the others. In particular, as enterprises
grow their document management needs, the need for cross-domain
authentication mechanisms grows. Likewise, the web will need richer
methods for expressing access control and authorization than most web
services currently provide.
<p> One common issue in all of the systems is detecting the boundary
of the item to which a particular authorization might apply. Access
control and authorization might need to apply to a different
granularity of object than is denoted with a single identifier.
<p>In general, one of the most troubling elements of AAA design is
that it is difficult to retrofit security in an architecture that
doesn't already have it. The analysis of likely threats often requires
revisiting optimizations made for performance reasons. For example, a
design which employs distribution and caching of documents close to
the site of access for performance reasons needs to account for the
risks embodied in having a repository of cached documents which might
be compromised.
<h3><a name="types">
2.4 Document types
</a></h3>
<!-- AP: Document types vs formats, what's the difference? -->
Generally, digital libraries, document management and the web manage
documents and not files. The unit of communication, the items being
stored and retrieved are representations of intellectual content, not
merely a sequence of bytes. However, documents are <i>represented</i>
as one or more sequences of bytes in a file system. The representation
is tagged with an indication of what kind of object it is. This
labeling is itself an issue in each area.
<h4>Document types in Document Management Systems</h4>
<p> Individual vendors of document management systems have frequently
created their own ad hoc registries, to allow their systems
to deal with multiple document types in a consistent way. More recent
work in the electronic mail vendors association and ODMA group have
created registries of well-known document types. Most generally,
though, document management systems restrict themselves to dealing
with the document types that either are common in desktop applications
in the workplace or else are registered by the system administrator of
the document management system.
<h4>Document types in Digital Libraries</h4>
<p> The range of kinds of media and digital objects that potentially
might be stored in a digital library is enormous. Currently, most
attempts to catalog material have used fairly ad hoc descriptions of
the files and their formats. A critical issue in the library
community, though, is <em>preservation</em><a href="#ref39">[39]</a><a
href="#ref40">[40]</a>. It is important to make sure recorded material
will be available in 10, 20, or 100 years. This is an issue not only
of the longevity of the storage medium (which can be mitigated by
refreshing the media), but, more importantly, the longevity of any
particular storage representation. If one were to preserve a file
that was created with Microsoft Word in 1995, how long is it expected
to have a Microsoft Word-capable reader in the future?<a href="#ref39"
>[39]</a>
<!-- DE: So, what solutions are there -->
<h4>Document types in the web</h4>
<p> The method for indicating the media type of an object in the
Internet arose from work on MIME: the Multipurpose Internet Mail
Exchange standard. MIME extended Internet electronic mail -- formerly
confined to the interchange of ASCII text -- by allowing for a rich
representation of objects and object types. The MIME standard allows
for the labeling of an object by its media type. Media types are
defined as a two part name (e.g., "text/html" or
"application/postscript") along with optional parameters. Media types
are categorized into several top-level types ("text", "image",
"audio", "application", "multipart") and then, within each top-level
type, an extensible set of subtypes. Each type can also define
parameters; for example, "text" types can have a "charset" parameter
where the character encoding used for the text is given. There is a
formal process for defining new media types, where information about
the type and required and allowed parameters are supplied.
<h3>Issues in Document Types</h3>
There are difficulties with the current mechanisms used for specifying
document types that are common in all of the application areas, and
affect the long-term interoperability and capability of the typing
system.
<h4>Type attributes</h4>
In many scenarios of use, it is important for one system element to be
able to interrogate across the network the type of a digital object to
determine if the local system is capable of processing or rendering
the object. For example, a reading machine might not bother to
retrieve an image-rich rendition of a document, but prefer one with
structural markup. In some cases, the coarse denotation of 'image' is
not sufficient; for example, it is important to note externally
whether the image is color or black-and-white, its resolution or other
attributes. Text documents may need to be annotated with a description
of the character encoding employed or the fonts used. These sub-type
attributions are difficult to deal with in many document type
definition systems.
<P> A related problem is that many document types are merely
references to specifications that are evolving over time. For example,
when the "application/postscript" type was originally proposed, there
was one version of Postscript. Now, there are two levels. The GIF
specification for images has two versions and a third under
development. A system element might be able to deal with some versions
and not others. Many type specification systems do not explicitly
allow for versioning.
<h4>Resources used</h4>
Many representations of documents implicitly rely on external
resources to actually define the interpretation of the file(s) that
comprise the document. Thus, a Postscript file also requires the
definitions of the fonts that it names; a TeX or nroff file also
requires the definitions of any macro packages it invokes. These
resource definitions are often assumed implicitly in the environment
rather than being called out separately. In the case where one wishes
to externally identify the media type, it may be necessary to also
name the resources assumed in a more explicit manner.
<h4>Preservation</h4>
The issues of preservation in the library community are of growing
concern in other areas. Companies with large repositories of
electronic documents are discovering that they have great difficulty
accessing them over time, not just because the storage media has
become obsolete, but also because conversion of old document formats
to new is difficult, unreliable and time-consuming.
<!-- AP: - [The 'Preservation' section seems redundant with other material in the
paper] -->
<h4>Open vs. Proprietary</h4>
In a number of cases, the definition of the document type is not
available outside the package that produces the type. While it might
be is reasonable within a limited context to define a document as a
being a `WordPerfect file', without a preserved specification of the
actual interpretation of WordPerfect files, this labeling may not be
useful decades hence. This is especially true because over time, there
may be many different versions, configurations for multiple platforms,
or localizations for different countries.
<h4>Compound objects</h4>
In many cases, the object being cataloged, manipulated and described
is a compound object: a sequential concatenation, a collection of
independent documents, or a compound object with some items nested
inside or referenced from others. Any system of externally labeling
and describing the type of the objects in use must be able to deal
with expression of the types of compounds.