(2:00 pm, Friday)
George R. Thoma, Ph.D.
National Library of Medicine
Bethesda, Maryland
Katherine F. Willis, Ph.D.
University of Michigan
Anita Wagner, BA, MS
Health Sciences Libraries Consortium
Frank L. Walker
National Library of Medicine
Bethesda, Maryland
Panel: Access to Document Images over the Internet
ABSTRACT
Electronic access to bibliographic and full text databases has been routinely done for many years, but the electronic retrieval of
complete documents, in particular journal articles, is rare even
today. This panel addresses the issues in accessing stores of
documents, both in electronic form in image databases as well as in
paper form that is converted to images at the point of request.
PAPER
George R. Thoma and Frank L. Walker
National Library of Medicine
Bethesda, Maryland
DocView: Documents to the End User's Desktop
Problem addressed
Electronic access to bibliographic and full-text databases has been routinely done for many years, but the electronic retrieval of complete documents, journal articles for example, is rare even today. The DocView project addresses the issues in providing end users access to electronic bitmapped documents over the Internet, both from image databases as well as from workstations (e.g., Ariel) that scan paper documents and transmit the images. Once the user receives the document images on the desktop, he or she may preview the pages on the screen, manipulate the image (zoom, scroll, pan), cut and paste portions of pages of interest, electronically "bookmark" desired pages, and print only the pages needed.
What is DocView?
Intended to provide end users document images via the Internet, the DocView prototype system is a client/server application consisting of Microsoft Windows-based client software at the user's machine and remotely located Unix-based document image servers. These servers use both inhouse-designed as well as public domain software, such as World Wide Web, FTP and Gopher. Most importantly, DocView delivers document images over the Internet, offering a faster, cheaper, more reliable and more convenient method of document delivery than either fax or mail, since the Internet offers higher speed, higher image resolution and lower transmission cost than the public telephone system. These advantages promise to become even more pronounced as the backbone speed of the Internet, currently at T3 (about 45 Mbps), moves up gradually to OC-3, OC-12 and eventually to Gigabits/second speeds.
Another way an end user may use DocView is to receive document images sent over the Internet from remotely located Ariel workstations. Ariel is a software package developed by Research Libraries Group for a workstation comprising a PC, a scanner and printer.
Many libraries are beginning to use Ariel for document transmission in a "fax-like" manner, but via Internet. The DocView client software is designed to receive Ariel-compatible documents and allow the same document image manipulation.
Whichever method is used, the end user may be physically separated from image servers or Ariel workstations by thousands of miles. The first method, client/server, allows the user to browse through a list of available documents such as journal articles, to select one and receive it immediately. The second method is less direct in that a user may contact a library, ask for a specific article, then have it sent directly to his computer. A blinking icon alerts the user to the arrival of a document from an Ariel station. While the first method promises rapid delivery of pre-scanned documents from an online document collection, the second method provides documents on demand, albeit slower. This method does not require documents to be preselected and stored. DocView is designed to allow a user to gain access to documents through both means simultaneously.
In either case, the user may preview the pages on the screen, manipulate the images (zoom, scroll, pan), cut and paste portions of pages of interest, electronically "bookmark" desired pages, and print only the pages needed.
DocView client: Features
The prototype DocView client software, an application that runs under Microsoft Windows version 3.1, requires a minimum 386 computer platform with a speed of 33 MHz or higher, a minimum of 8 Megabytes of memory, and an Internet connection. The user's software must include Windows Sockets, supplied by the manufacturer of the TCP/IP protocol stack used in the computer. Most major manufacturers of TCP/IP stacks for personal computers now supply a Window Sockets dynamic link library for Windows Sockets applications.
The DocView client software allows the user to receive document images either actively or passively. In the active mode, the user employs a dialog box to select a server on Internet, connect to it, choose a document and download it. Once the server is selected, the DocView client connects to the server automatically. A citation to the first document stored at the server is displayed in the dialog box. By using the First Doc, Last Doc, Prev Doc and Next Doc buttons, the user may browse the documents available at the server. The user may download either the first page of any document to preview it, or the entire document. When transmission is completed, DocView notifies the user through an optional audible signal.
In the passive mode, an operator of a remote Ariel workstation scans a document and sends it to the DocView client. This would normally be in response to a user request made over email, telephone, fax or other electronic means. When the document from Ariel is received, DocView notifies the user with a blinking icon.
Whether a DocView user receives documents from a server or from an Ariel workstation, the method for viewing and manipulating the images is the same. Documents are displayed in separate windows on the screen. The windows may be cascaded, tiled or minimized using the Windows Multiple Document Interface. The user interface contains menus along the top of the screen and a toolbox of buttons representing the most commonly used functions in the menu. Each window may be maximized so that the page image may be easily read on the screen. The minimum recommended screen resolution is VGA, or 640x480. Higher resolutions allow more of the page to be displayed at a readable level. It is possible to zoom in on the image or to shrink it. Scroll bars are available for panning and scrolling zoomed images. The user may also rotate images in 90 degree steps, a useful function for viewing pages printed in landscape rather than portrait mode, or for pages scanned upside down which sometimes happens at Ariel stations.
The DocView user interface provides functions for easily browsing through the electronic document. It has a Next Page button, Previous Page button, and a Page Jump Button. The Page Jump function allows the user to move to any arbitrary page. DocView also contains an electronic bookmark function. Electronic bookmarks do for the electronic document what real bookmarks do for paper documents: they keep track of important sections and allow easy movement from one important section to another. Any number of pages may be marked. When marked, a page appears to have the upper right corner bent down. The Page Jump function allows movement between marked pages by allowing the user to move to the First Marked Page, Last Marked Page, Previous Marked Page, or Next Marked Page.
DocView contains a versatile print function that allows the user to print pages either in the current document or all documents. This function offers a number of options, including printing the cover page, currently displayed page, marked pages, all pages or a sequence of pages. The number of copies may vary from 1 to 10, and each page may be shrunk down to 75% in size.
Another feature is a copy function that permits the user to select part of an image and copy it to the Windows clipboard. This allows the user to create new documents from the ones received through the Internet. The portion of the image copied may be automatically scaled larger or smaller, or kept at screen resolution.
Finally, the user may obtain context sensitive help by pressing the help button associated with that function. DocView offers a complete help facility on any aspect of operating the user interface.
Server alternatives
The server technologies implemented to access the bitmapped document images include public domain software such as Gopher, World Wide Web (WWW), and FTP. In this case, Mosaic (now used by more than a million users on the Internet) would provide the connection to any of these remote servers, while DocView acts as a document viewer.
In addition to these servers, prototype server software for DocView running on Unix machines has been designed with a view to investigating improved access. Three platforms are currently being used: a Sun SPARCServer 690 located inhouse, a Sun machine at the University of Arizona, and a Convex supercomputer on the NIH campus. The software, written in C, runs unmodified on all of these computers. However, DocView could be designed to use any computer and operating system for the Internet document server. For instance, the server could be built on a Windows NT machine, or it could run on a VAX. The only requirement is that the server have access to Internet.
All of these servers have access to an experimental set of medical journal articles, which have been previously scanned at 300 dpi, compressed CCITT Group 4, and stored as TIFF images. The size of a typical compressed image is 100 kilobytes, and the average length of a typical article ten pages, resulting in about 1 Megabyte per article.
The Internet communications is done using Berkeley sockets. This technique uses TCP/IP connection-based stream sockets that provide bi-directional, reliable, sequenced and unduplicated flow of data between a server and client over Internet. A client may send six types of queries to the server, which may return five responses. The types of query/response packets sent between server and client and a description of typical sessions appear in a recent paper [ref.1]. Future expansion of the client/server protocol could include provisions for the exchange of cost information and credit card numbers to allow the DocView user to pay for the requested document.
Current Status
The dependencies of earlier DocView prototypes on commercial hardware and software are being eliminated to produce an entirely inhouse product. Following this development and an inhouse alpha testing stage, DocView will be beta tested in appropriate environments.
The basic evaluation goals are to investigate system performance, image quality, cost and user satisfaction with the features provided. It is of interest, for example, to find out if all the features provided are useful and desirable, and whether other functions are needed, e.g., OCR of received images to create text data, and whether this would be useful to conduct full-text searches or to append to the user's word processed documents.
Among the questions to be addressed are the following:
Does direct access to an electronic document store and online retrieval of documents over the Internet result in time and cost savings? By how much?
What are the technical specifications of the hardware and software for an affordable direct document access system?
What functions/features of the user interface contribute to easy document usage, speed of task completion, and overall user satisfaction?
What strategy should be used to select the part of the collection to be stored electronically and the part that should be left to on-demand Ariel transmission?
What strategy should be used to scan documents for electronic storage at a resolution of 300 dpi vs. 400 dpi?
Are some parts of an electronic collection better stored on magnetic disk RAIDs rather than in optical disk jukeboxes?
What are the network problems on LANs and the Internet?
Reference
1. Walker FL, Thoma GR. Access to document images over the Internet. Proc. IOLS'94. Medford NJ: Learned Information, 1994; 185-97.
For more information, contact
Communications Engineering Branch
Lister Hill National Center for Biomedical Communications
National Library of Medicine
8600 Rockville Pike, Bethesda, Maryland 20894
Phone: (301) 496-4496
Fax: (301) 402-0341
Internet: walker@ceb.nlm.nih.gov or
thoma@lhc.nlm.nih.gov
Katherine Willis - University of Michigan
Introduction
Taken in their ideal sense, digital libraries provide their users with access to the desired information regardless of its medium and without the constraints of time or place associated with conventional libraries. Moreover, digital libraries enable personalized information harvesting and management so that their users can develop their own ways of viewing and collecting the information. Achieving this vision of the digital library offers major challenges for today's publishers, librarians and technologists necessitating a variety of pilot projects through which components can be developed and experience can be gained. This article considers the driving forces, issues, and lessons learned by Carnegie Mellon University, the University of California System, the University of Michigan, the University of Tennessee, and Elsevier Science in implementing TULIP (The University Licensing Program), a research collaboration to build digital libraries.
Research University Expectations and Driving Forces
The fundamental mission and centuries old tradition of libraries and the library profession has been to provide intellectual and physical access to and preservation of the human record. Increasingly, however, the high cost of paper-based resources, especially science and technology journals, and the burgeoning demands for physical space to contain exponentially expanding collections are denying libraries their ability to fulfill their mission. Furthermore, aware of the capabilities inherent in the digital form, librarians recognize opportunities for new services that will better support their communities of users. Finally, although scholars and students, traditionally reliant on a library as a provider of information, are not in mass demanding a different library form, many are aware of the potential digital information holds for them in improving ease of information access and management. The early adopters evident in research universities certainly understand the promise of the digital library and create a base for experimentation by their institutions' libraries.
If nothing else, most information is now created in digital form, bringing inevitably an end to the printed word's domination as its distribution medium. Indeed, given its digital characteristic, information is theoretically freed from constraints on its access through time restrictions on its use or geographic barriers demanded by its storage in a building. Improvements in document representation and structuring through mark-up languages like SGML mean that scholars and students will be able to search across disciplines and to review sources of all types with the likely outcome of whole new areas of study.
Since information is the fundamental raw material of the education process, its broadest availability better meets the requirements of the university. Not surprising, then, the inherent nature and inevitability of information in digital form enabling enhanced service and flexibility while reducing the rigidity inherent in information in analog form has driven university research libraries to invest in digital library development.
Issues from a University Perspective
Management of intellectual property rights is perhaps the most frequently cited concern when the topic of digital libraries is raised. So many of the features that make the digital form desirable simultaneously threaten the controlled access found in printed materials. No longer are purchases by libraries or individuals the measure of an available copy or insurance of fee for use. In fact, practices such as interlibrary loan, long a valued service offered among libraries, are increasingly receiving pressure from publishers and others with intellectual property concerns as they seek ways to tighten control over distribution. Perhaps because of the very prominence of this issue in any discussion of digital libraries, its resolution through technology, law and social practice may occur sooner than that of some the other issues.
Although few disagree that the intellectual property holder deserves fair reward, library and individual funding models will have to fundamentally change with the implementation of a digital library. Under current practices, libraries receive one time funds to build physical structures and expand shelf space; purchase of information occurs through collection budgets. However, as libraries become vehicles of access rather than ownership, this method of funding requires substantial change. Similarly, the patron accustomed to "free" information through a library, is likely to incur charges as he or she accesses the digital library or requests resources that the library no longer owns. The new financial models that must and will emerge as digital libraries become real offer fundamental challenges that will be difficult to resolve easily since they strike at both long-term practices and basic library traditions.
Equally problematic to the issues of finance is the readiness of the user for the digital library. In some sense, there is the possibility of preparing a party to which no one comes, especially if the cost of admission is too high. In the case of scholars who have established research methods, it is not clear whether they will adopt this new strategy for acquiring their information. Students seem likely candidates since most are technically literate and have a more single-minded focus than the faculty or science researcher. Given the evolutionary cycle for the digital library, it is reasonable to expect that the next generation of scholars and students will be the adopters and beneficiaries. Clearly, libraries are faced with a long transition from the current paper-based collection to digital collections and face substantial strain on human and financial resources.
Already, librarians, as information professionals serving particular communities, have seen changes to their traditional role of collection developer and preserver. They speak of "just in time" rather than "just in case" strategies. As information becomes increasingly available in digital form, however, librarians will be called upon for evaluation and quality assurance. Given the volume of information itself, users are likely to ask their information professionals to create personalized libraries for them, harvesting sources that meet particular needs in a variety of formats. Working as members of project teams, librarians may also become publishers, capturing and producing the official record of the group. The use of the professional skills of the librarian for sophisticated information access and management become possible by lessening, if not eliminating, the current demands for simplistic retrieval inherent in the limited searching capability of today's paper collections.
Maintaining or accessing a digital library assumes a level of technical infrastructure not required by users of current libraries. Although it is obvious that this includes adequate network bandwidth to enable a scholar or student to retrieve the digital resources, a suitable computer for their display, printers for output, and storage capability for the information provider, the difficulty in integrating all of those components and others into an easy to use system is less obvious except perhaps to those who, through pilots, have begun to tackle the problems. Indeed, each of the universities involved in TULIP underestimated this difficulty in different ways.
Publisher Expectations and Driving Forces
For the publisher, the notion of a digital library means bringing information to the users' desktops (or laptops) and providing tools to give users easy access to the information they need (and to filter out that which is not needed). Successful products will improve the efficiency of the customer, whether that customer is the library, with its purchasing, organizing and training roles, or the final consumer of the information. As information's value may increase with the sophistication of its software environment, it may make sense to talk less of "kinds of products" and more of a range of information tools and databases upon which users can act.
The most effective authors, readers, libraries and publishers in a digital world will be those who become most actively involved in the process of making information more accessible and relevant. The digital library's facility to interact with the information may change the model and the roles played. But this is a win-win game. Value added by one player (be it the author in flagging links or adding manipulatable data, the reader in using tools to increase the relevancy of his searches, the library in selecting and enabling desktop access or the publisher in structuring and indexing information for optimal retrieval) does not require a reduction in the value added or the role played by another player. As new and different skills are learned and incorporated, the overall quality of the information service improves.
Cost remains the binding constraint: cost of wide-area bandwidth, cost of storage, training, technical and editorial people, etc. There is far more technical capacity than money to pay for it. Progress is inevitably pushed by the enormous promise and potential of the digital library. It is a dream with which people connect on an emotional level: anything, anywhere, any time. Until or unless the brightness of the dream fades, that will drive us all forward.
Issues from a Publisher's Perspective
Although the tremendous potential of the digital library is recognized by all parties involved, and we can see that there is a large technical capacity available or developing to realize that potential, the actual implementation of the dream is still quite a few steps away. Cost is an important issue. In addition, we need to deal with questions about how to fulfill the promise of the digital library; that is, how do we make sure that we develop (types of) information tools and databases users actually need? In other words, what can we publishers, universities, network providers, etc. do to improve the way information is made accessible in an electronic environment?
To address some of these issues, Elsevier Science has joined forces with universities in the United States in the project TULIP.
TULIP
TULIP was started in 1991 as a result of discussions between university leaders at a number of schools with Elsevier Science (ES) to find a way to accelerate the development of large scale systems for the distribution in electronic form of traditional journal information--information presently found only in print. ES was looking at this question as a publisher and desired experience on which to make strategic developmental and investment decisions, whether in search software, document delivery systems, PostScript or SGML database files or network development. During a Coalition for Networked Information (CNI) meeting in Spring 1991, Karen Hunter of ES, and several university participants, outlined a project, and so TULIP was started.
The universities created proposals, and ES started to establish the technical and organizational framework necessary for such a large project. The TULIP program became operational in January 1993 with nine universities participating: Carnegie Mellon University, Cornell University, Georgia Institute of Technology, Massachusetts Institute of Technology, University of California (all campuses), University of Michigan, University of Tennessee, University of Washington and Virginia Polytechnic Institute and State University.
Objectives. TULIP is a cooperative research project testing systems for networked delivery and use of journals to the users desktop. Specifically, the participants expected to:
- Technical. Determine the technical feasibility of networked distribution to and across institutions with varying levels of sophistication in their technical infrastructure. "Networked distribution" means sending the information both across the national Internet and over campus networks to the desktops of students and faculty. ES delivers the journal information to participating universities in standard formats. The universities incorporate the information in local prototype or operational systems. The participants also expected to compare a wide variety of delivery alternatives, search and retrieval systems and print-on-demand options.
- Organizational and economic. Understand, through the implementation of prototypes, alternative costing, pricing, subscription and market models that may be "viable" in electronic distribution scenarios; comparing such models with existing print-then-distribute models; and understanding the role of campus organizational units under such scenarios. The overall goal is to reduce the unit cost of information delivery and retrieval. "Viable" means economically and functionally acceptable to all parties.
- User behavior. Study reader usage patterns under different distribution (technical, organizational and economic) situations and consider improvement in the functionality of the information, whether as to article structure or retrieval tools. They planned to collect certain data uniformly at all sites for analysis in the aggregate and for comparison among different systems.
TULIP formats, distribution and implementations. ES is providing electronic files for 43 Elsevier journals in materials science and engineering. These files consist of: TIFF bit-mapped page images (cover-to-cover, scanned at 300 dpi, Group IV fax compression), edited and structured ASCII Ōheads" for each editorial item (including bibliographic citation and article abstract) and unedited, OCR-generated ASCII 'full text' for use in searching, but not for display.
During the project, each university receives, without charge, the electronic full-text (bit-mapped and ASCII) for those journals to which they subscribe in paper. They also receive the bibliographic information for all 43 journals and have on-demand access on a pay-per-use basis to those titles to which they do not subscribe. With the exception of one institution, all the universities mount the full text files of the journals to which they subscribe locally on their own file servers. The TULIP data files are not usable without hardware and software which are not part of the experimental product. These are provided by the university. Each TULIP site implementation has a different storage, retrieval or display and printing features, from high resolution images sent directly to desktop workstations to DocuTech print-on-demand of individual articles. This variety, an outgrowth of individual campus computing architectures, has provided TULIP participants with some comparative understanding for their future digital library development.
Timeline. Begun in early 1993, TULIP will continue through 1995. In the first part of the project both the universities and ES focused on technical issues; in the second part, they expect to concentrate on studying user behavior. During the project definition period, the nine research universities and Elsevier believed that they would have little difficulty in disseminating the selected materials science journals to the researchers desktop; and, while they expected to test technological capability of their campus' infrastructures, they did not anticipate any substantial problem. They sought involvement in the project because it provided an opportunity to gain valuable experience in digital library building. It also offered a means through which they could gain knowledge of their scholars' uses of information. The experiences of four institutions are presented in this article.
University of California System
System Description. At the University of California System, the TULIP images are stored centrally and made available over the UC network to all nine campuses. The files have been linked to records in the Inspect1 and Current Contents2 databases in the University's MELVYL3 online public access system. Users perform normal MELVYL searches to retrieve records from those databases. If there is an image linked to a retrieved record, an informational message is displayed as part of the normal MELVYL system displays. There is also a facility to limit retrieved records to only those that have linked images. Having retrieved records in the MELVYL system that have linked images, users can issue a display image command to request that article images associated with a retrieved record be displayed on their output device. Currently, any output device capable of running the X windows protocol is supported. The images are stored on an optical jukebox controlled by a UNIX workstation. This workstation also runs the software that projects the windows onto the end user's device. When a user requests that article images be displayed, the MELVYL system gathers the necessary information and communicates it to the image server. The image server receives from the MELVYL system the requested citation and other information necessary to identify the end user, and using a database resident on the UNIX workstation, maps the article to the corresponding bitmapped image files which it fetches off of the optical jukebox. It then opens up windows on the user's workstation to display them. Facilities exist that allow the user to page back and forth in the article, adjust the size of the image, reverse the video, and print selected pages or the entire article. Currently, postscript printers attached to print servers that support the TCP/IP LPR printing protocol are supported. Once an article has been displayed, facilities exist in the image viewing software that allow browsing around the issue in which that article was found, without the need to issue another request on the MELVYL system.
There is an option to ask for thenext or previous article, and to request browsing the table of contents for the issue. If that option is chosen, a table of contents window is shown to the user, who may then go directly to any article in the issue. There is also a large statistical component built into the image viewing software. This facility allows collection of both demographic data about the types of users who are using the system, and also extensive data about user behavior patterns that will help us gain more knowledge of how user's access and make use of electronic information of this kind. This knowledge and understanding will help us better design future systems of this type. Currently, plans are underway for adding fax support to the image viewing interface and to allow users without access to viewing workstations to request print or fax copies of material directly from the MELVYL system.
Implementation Lessons Learned. Although the information superhighway and digital libraries are conceptually very appealing, TULIP has demonstrated that implementing the systems to operationalize these concepts is quite difficult. This is due not to the availability of needed technologies but rather to the infrastructure upon which the systems builder is dependent. Although TULIP is a limited prototype system with a relatively small number of journals and of interest to a small focused user community, it highlighted deficiencies within the University ofCalifornia's infrastructure. These include:
- Storage. The UC images for TULIP are stored on an optical jukebox. While more economical than magnetic disk storage, the costs are still significant for such a large amount of data. There are also performance implications for using optical technology because it is still much slower than magnetic disk technology. This is acerbated by the use of a jukebox since an optical platter must be fetched and mounted in response to a user's access request. Caching techniques using magnetic disks try to keep the most requested material more immediately available, but variations in access patterns can affect the efficacy of those algorithms. However, given the recent introduction of high speed, high capacity RAID magnetic storage devices, and what seems to be the rapid and constant advancement of magnetic disk technology, it is now becoming more practical to start using magnetic disks with their enhanced performance and response time capabilities for projects of this kind. It is likely that optical technology will continue to be used for less frequently accessed material since fetching it off of an optical jukebox once and then caching it to magnetic media gives accept-able performance.
- Network Bandwidth. Displaying bitmapped images require a significant amount of network bandwidth. While local area technologies are improving and providing ever increasing bandwidth to the desktop, making systems like TULIP more practical in local area and campus-wide environments, there is still a large discrepancy between these speeds and those available in the wide area network. Since the UC TU-LIP implementation is centrally located and images are projected onto users' workstations using our inter-campus wide area network, that discrepancy in speed has an impact on the performance of the system to the end user. Increasing bandwidth in wide area networks and the use of compression will help address this problem. The current, widely deployed versions of the X Windows protocol do not contain support for moving compressed images. Although the newest version does add this support, it must be more widely deployed for the feature to have impact. Increases in network bandwidth will certainly help although it may be hard to stay ahead of the ever increasing demand for it as new applications and data types become popular.
- The information provider--in this case, Elsevier--is also very dependent upon bandwidth. One of the goals for TULIP was to test the viability of delivering material to the participants using the Internet and the FTP file transfer protocol.
- For most of the TULIP participants who have chosen this method of receiving the material, this has resulted in the transfer of approximately 500 megabytes across the Internet every two weeks. Depending on network traffic, the process takes between 3-7 hours to transfer the material to each site. Given that the TULIP project involves only 43 journals, it is unlikely that this method of distribution is scaleable even with projected enhancements to the Internet.
- User Computing Environment. Systems like TULIP require a well- networked user community. In preparing for TULIP, a survey was done of the installed equipment base among the material science and related departments on the UC campuses. We found that there was a wide divergence among the campuses in the types and amount of computer equipment in those departments and in their connectivity to the campus networks. This problem only increases when consideration is given to providing electronic information to a more heterogeneous and widely diversified user community. Although network bandwidth is a major factor in performance of systems like TULIP, the speed of the destination workstation CPU and the amount of memory available to it are also significant factors in the performance the end user experiences.
- Display Technology. The TULIP images are scanned at 300 dpi. The average computer display monitor has a resolution of 75-100 dpi. The difference between the resolution at which the images are scanned and the one at which they can be displayed online has an effect on their readability. Higher resolution monitors are available but are still quite expensive. In addition, the size of the display device and whether it can support such things as gray scaling have an impact on the usability of the images online. Larger size monitors that go beyond the 14-16 inch type found on most personal computers make the images much more readable but come at a steep cost beyond the affordability of most users. The quality of the display device is an important factor in how receptive the user will be to actually using online images.
- Printing. One of the obvious things users want to do with the TULIP images is print them out. This is actually a much harder problem than it sounds due to the lack of a common unified printing infrastructure inside the University. Printing is currently supported on to any postscript printer attached to the campus TCP/IP networks that can be accessed via the LPR printing protocol. Unfortunately, there are a large number of printers in users' departments that do not meet those requirements, and thus can not be used for TULIP printing. These printers are attached to local area and departmental networks (or in many cases directly attached to a user's workstation) and can not be accessed by an application like TULIP that is running remotely. Until such infrastructure is put in place that provides better network connectivity for those locally connected printers, there will continue to be a population of user and departmental printers inaccessible to applications like TULIP. There is also the related issue of developing infrastructure and policies to determine what printers any user is allowed to access and doing the appropriate charging and accounting. Such infrastructure ideally should be developed on a university-wide, rather than a campus-wide basis, to avoid separate and different solutions on each of the campuses.
- User Authentication. Currently, TULIP uses a simple password scheme for user identification and verification. It has become clear that such a simple scheme will not scale well, and that much more sophisticated mechanisms are required for user authentication. This is especially true as more and more services come into being that require distinguishing among different classes of users for such things as accounting and charge back (like printing).
- However, while several of the UC campuses are experimenting with user authentication systems (most notably with Kerberos), there is not yet any generalized widespread user authentication infrastructure in place inside the University of the type that will be required for systems like TULIP to operate in a true production mode.
- Training and Support. Systems, like TULIP, that utilize higher end technology such as X Windows devices may require more training and support than has been the case with more traditional online systems. In addition, because of the technology they use, they tend to require more complicated and customized configuration during their initial setup phases. They also may require that existing support and training staff acquire new skills. This will put further pressure on already strained and decreasing library and computer center budgets. Mechanisms, such as good help systems, need to be designed into such systems to reduce their impact on support and training staff.
All of these issues are manageable and even, to a large extent, acceptable within the context of a small research prototype project like TULIP. But, many of them, especially issues involving data storage, network bandwidth, equipment infrastructure and printing pose much larger problems that are compounded more quickly when envisioning expanding the scope of projects like TULIP to providing access to a wider range of material that will be more heavily accessed by a wider user community. None of the comments above are meant to imply any sense of failure about the TULIP project. On the contrary, we view the TULIP implementation at UC as being a great success. It has provided a service to our users, and it has allowed us to gain experience with the design issues and the technology necessary to produce other projects of a similar nature. And, perhaps most important, by raising the issues discussed above, it has given them more visibility in the University community in general and has begun to get people thinking about what infrastructures, mechanisms, and polices are needed so that other projects like TULIP can be created to provide more widespread, general access to electronic full text information to the university community as a whole. Finally, TULIP was our initial foray at the University of California into this whole area. Having developed a system for TULIP, we plan on using it in other projects that are coming online, building on the technology and, hopefully gaining from the lessons we have learned and our experience.
Carnegie Mellon University
System Description. Implementation of TULIP at Carnegie Mellon University has always been viewed as one part of the strategy in the transition to an electronic library. Important developmental targets are building a sizable body of useful, electronic full text data; widely distributing this data to campus; and without compromising user privacy, collecting information about user behavior in this electronic library environment. Working towards these targets in a continually evolving distributed architecture results in an environment that is both flexible and constraining.
Currently, TULIP is implemented using the software developed by Project Mercury for the Library Information System (LIS). LIS is a client/server system distributed through Andrew, the campus computing network. TULIP information is accessible from any computer that can connect to the campus network by anyone having a Carnegie Mellon user ID that is authenticated by Kerberos software developed at MIT.
Two access methods for retrieving the images are used: page images linked to the bibliographic records and a hierarchical browser. The bibliographic data for articles in the 43 TULIP journals is indexed using Newton, search and retrieval software developed by the Online Computer Library Center (OCLC). The bit-mapped images for the articles in 29 TULIP journals for which Carnegie Mellon has a paper subscription are linked with this bibliographic index. The LIS software has two user interfaces: VT100, which runs on any machine that can emulate a terminal, and Motif, which requires a UNIX workstation running X windows. The Materials Science bibliographic database can be searched using either interface, but a UNIX workstation is required to view the images linked to this data. The images are sent over the network compressed and rapidly decompressed and displayed at the user's workstation. If a user wants to retrieve a known citation or journal issue, the browser may be used by selecting this option while using LIS. The page images of the retrieved article may be navigated by clicking on selected buttons to move back and forth between the pages.
Implementation Lessons Learned. TULIP implementation issues are either technical, managerial or a combination. Some of the issues that have surfaced are: the computing platform of the users, the functionality of the browser interface, the difficulty in logging data in a client/server environment, printing in a distributed network, coordination of technical work with publicity and promotion, and the time and energy required to develop and implement a production system.
- User Computing Environment. Additional interfaces need to be developed for the CMU Macintosh and Windows environments. The TULIP survey of the 172 users in the Materials Science Engineering (MSE) Department, the primary user group, showed that UNIX workstations are not the primary computing platform. UNIX workstations are available in the Engineering and Science Library (E & S), which is in close proximity to the departmental offices. However, to encourage the use of TULIP by providing more convenient access to appropriate equipment, a DECstation dedicated to LIS has been loaned to MSE for their computer cluster. Other potential TULIP users on campus may also have limited ability to access this system from their desktops.
- Interface. Focus group feedback and protocol analysis have uncovered interface design problems and the necessity for additional interface functionality, especially in the image browser. The document display window has some missing navigational features. Some inconsistencies in the paging mechanism, depending on whether access is through the bibliographic database or the hierarchical browser, need to be corrected.
- Charging. Providing networked printing with the associated billing mechanism is a complicated operation. There is no charge for viewing images, but to be consistent with current service practice for providing article prints, there is a modest fee for printing the articles. A ~work around~ in place for prints involves sending electronic mail to an operator who prints the hard copy to be delivered in campus mail or picked up by the user in the E&S Library.
- Behavioral Studies. In order to develop a digital library that serves its users, it is necessary to gain insight into how the information is used and what changes may occur as a result of digital access. However, logging activity on the client workstation in a way that protects the privacy of users is more complex in a distributed environment.
- Promotion. Synchronizing technical work with publicity and promotion of TULIP is critical to generate and sustain interest and participation in the project. In addition to factors already mentioned, the bibliographic and image data must be current. Difficulty in receiving data via FTP together with completing the indexing for the most current data seems to be another factor affecting the usage. A major promotion of TULIP has been postponed until software for accessing image and bibliographic databases has been thoroughly tested and the databases themselves are current.
The time and energy required to implement a production system always exceeds desires and expectations. Unforeseen problems are inevitable in development projects. At Carnegie Mellon, staff working on TULIP must also work on other projects designed to leverage all work done on the electronic library implementation. Any slip in completion dates in related projects may result in delays in the TULIP project. Staff changes can result in additional project delays as new staff get ~up to speed.~ In spite of this, progress on TULIP is encouraging as we learn from experiences in developing the electronic library
The University of Michigan
System Design. In order to ensure that all of its users have access to the journals, the University of Michigan designed its TULIP implementation to integrate the citation and abstract information into the University Library's Management System through which a print copy is also available. It also offers researchers access to the full text displayed on their workstation as well as the choice of any printer attached to the campus network through the University of Michigan Digital Library project. A student or researcher can use the system either through a terminal emulator which connects to the library management system, specifically the NOTIS LMS/MDAS software, or a Mosaic client that attaches to a server to perform searches and retrieve journal articles. A third option, TULIPView, was developed before the availability of Mosaic and supports users who prefer an X-Windows implementation.
Since the TULIP data sets are large, typically 500 megabytes each, a hierarchical approach to storage is used. The server, a Sun Microsystems SPARstation 20 with 45 gigabytes of storage, is used for all current journals and articles that have been used recently or frequently. Older issues are written to CD-ROMÕs joining those of 1992 back issues which Elsevier provided at the start of the project. A Kubik 240-disc carousel is used for CD-ROM access.
Michigan's World Wide Web TULIP implementation combines the features of a browsing system and a searching system. Using a tree structure, users move from a list of the journals down through available volumes and issues to a "table of contents" containing author and title information for a particular issue. Selecting an article then leads the user to a page containing the full abstract/citation information for that article. This also contains links which the user can follow to view the full page images of the article. These images are in either gray-scale or monochrome and either 75 dpi or 100 dpi depending on the capabilities of the local computer.
At any level of this hierarchy, down to and including the individual issue, the user can choose to perform a search on anything lower on the browser tree. For example, the user can search for a particular topic in a particular issue of a journal. The search engine used is the Full Text Lexicographer system, designed at the University of Michigan. It allows for fielded and full text searching of both the abstract/citation information and the OCR'ed full text of the page images.
As an alternative to Mosaic, the TULIPView user interface runs on UNIX workstations supporting the X-windows windowing system. Using TULIPview, researchers may either use a simplified menu approach or Boolean and proximity operators to institute their search interactively. They may also use a profiling or notification function. A successful search query can be stored as a profile. Whenever a new TULIP data set is received, these stored profiles are automatically run against all of the new articles. The abstracts of articles that match the profile are formatted and sent as an electronic mail message complete with a table of contents.
For the system to work, several programs must be executed by the server. As new information arrives, it is indexed using Full Text Lexicographer. A server program built around this search engine accepts connections from TULIPview clients and performs searches and retrieves documents. A notification program runs stored profiles against all new articles. A print server program accepts print requests, converts article images to printable format, and sends them off to any of several hundred laser printers around campus. The TULIP server also runs a package called CAP (Columbia AppleTalk), which allows it to print to AppleTalk printers in addition to the UNIX printers. As a result, faculty or students can search and view the articles on their desktop workstations, and then print those they select on the closest or most convenient printer.
Those faculty and students who chose to access TULIP through the library management system can take advantage of its integration with other holdings including information about the location and availability of the journals' paper copies. In this case, they are able to search by author, title or keyword; build table of contents 'on the fly' by searching for a combination of journal title and date; and request the article be printed at any of eight printers identified on a selection menu. They cannot display the article on their workstation.
Implementation Lessons Learned. Since students and faculty have long experience with traditional libraries, they hold strong expectations of the service they should receive from a digital library. These include access to a comprehensive collection containing past and very current sources. If they are going to change their research strategies to take advantage of the capabilities offered by information in digital form, then they must feel assured that the digital library offers a value so compelling that the change is worthwhile.
- Storage. As a result of its TULIP experience, the University of Michigan learned that it must exercise a hierarchical storage strategy and ensure within that strategy adequate technology to maintain current contents rapidly accessible. At a broader level, TULIP raises the question of whether regional strategies are desirable to avoid the redundancy in holdings currently overwhelming the traditional library through cost and physical mass.
- Collection. Based on a survey of the scientists involved in the project, journal literature was identified as a primary source. However, without expanding the digital holdings so that researchers avoid searching in multiple locations, both digital and physical, TULIP will remain at the periphery of their use. Fortunately, Elsevier has recognized this problem as well and has increased the availability of journals. Nevertheless, the University and its peers need to work with publishers and professional societies to expand the availability of content through the digital library.
- Charging. The University has not charged real dollars for use of digital resources in the past with the exception of extensive searching of commercial databases like Dialog. As the digital library begins to offer commercial resources for which licensing or pay for use require payment, electronic billing is required. Initially, we expected to attach TULIP to a campus billing system adopted to satisfy requirements of the Michigan's distributed computing environment. However, the attractiveness of the MOSAIC interface for the digital library caused us to abandon that strategy and to depend on the future implementation of an Internet billing system developed by a third party. Unfortunately, this has delayed the development of policies and budgeting methods that will certainly be required when such a billing system is introduced.
- Researcher use patterns. Electronic logs in a 3:1 ratio indicate that those who move from an abstract to the article's full text read the screen output rather than print the article. Since most have desktop printers, convenience and speed are not factors. More likely, they recognize that the article is available on demand for later in-depth study and use an initial screen reading to note those of value for future reading and citation. This, however, is only an hypothesis and much more analysis of user behavior is necessary.
University of Tennessee at Knoxville
In implementing its TULIP project, the University of Tennessee at Knoxville (UTK) decided to leverage its strong relationship with Martin Marietta Energy Systems (MMES) at the Oak Ridge National Laboratory. This enabled it to test the technical, behavioral and business implications of two different types of organizations using the Elsevier Collection. It also allowed access to the aggregate journals of the two institution's libraries and provided UTK faculty holding joint appointments the opportunity to utilize the information resources without geographic constraint.
System Description. Having inventoried the computing environment of the UTK and Marietta materials scientists, the system developers determined that they would provide tables of contents and abstracts only, since the researchers' workstations were primarily personnel computers and server storage capacity was not available to manage complete image files. The UTK Library acquired a standard IBM Valuepoint 486 PC which is equipped with 20 megabytes of RAM, a gigabyte of storage, a high-end graphics card, a CD-ROM drive, and a 17-inch monitor. Linux, a public domain version of UNIX, is used as its operating system and a high capacity laser printer enables printing of the image files.
Data sets sent via the Internet from Engineering Information (Ei) are mounted locally and are parsed using a program developed at UTK by a student programmer. They are then moved into the University's Online Library Information System (OLIS) where they are available to the libraries' users. Access to the table of contents is available to any researcher by dialing into the UTK gopher, but access to the abstracts is limited to UTK and MMES users in accordance with the Elsevier's project requirements.
In response to the Marietta scientists' requirements for keyword searching, WAIS indexing was added. This greatly enhanced the use of TULIP. In May of 1993, the TULIP data sets were integrated into OLIS and the researchers, by completing a profile sheet, became eligible to receive full text articles without charge during the test period. The profile asks for demographic information, computing capabilities, information-seeking habits, and preferences for TULIP article delivery.
As registered users browse the table of contents or abstracts via OLIS to identify articles of interest, they note the unique item-identifier which is assigned to each article by Ei. Requests are sent to the UTK/MMES TULIP electronic mail address. A student checks the account daily and sends any requests to Ei via electronic mail; Ei sends the page images via FTP; the student prints them on a laser printer; and, then sends the copies by fax or courier to the requester. The student can also forward the article to the user's workstation if they have appropriate equipment and software. Removal of the library as an intermediary is a design goal.
The gopher server log analyzer keeps a record of usage data on the table of contents and abstracts files making it possible to determine if users are in the library, on campus, at MMES, or off-campus. Analysis of which journal abstracts are being read is also possible. The number of browsers is high, especially when considering the number of registrants. For example, from the gopher log we found the number of callers browsing the abstracts ranged from about 20 in October 1993 to 90 in January of 1994 and 110 in February of 1994. Use of the current paper copies in the test group was monitored for six months by the UTK Periodicals staff before TULIP was available on the gopher. They have continued to keep the statistics so that we may measure the impact of the electronic tables of contents against use of the paper issues before and during the testing period.
Implementation Lessons Learned. Designing and implementing a digital library requires a complex set of skills, resources and institutional commitment. Its implementation can range from a more simple design to a very complex, full-functioned system. However, regardless of this range, a digital library cannot be developed at the margin. Time, money, and technical expertise are definitely needed.
- Political Considerations. Building an effective team representing different areas of expertise is essential. Librarians, staff and students from Networked Services, Systems, Reference, and Periodicals at UTK, librarians from MMES, and a professor from the UTK School of Information Science are all part of the TULIP team. Staff from the UTK Computing Center have also been consulted.
- Budget. Financial support is certainly needed, depending on available technical resources and the commitment that the institution wants to devote to electronic resources. Clearly, storage, print and display technology govern the system's design.
- User Behavior. Although users are browsing the files, use of the article service is negligible, as is use of the journals in the test group. This is surprising because the Materials Science faculty and students expressed an interest in all of the journals. However, it may be that their actual research lies elsewhere. It is also likely that the steps now required to obtain the images discourage use of the files.
Although several issues require resolution, UTK's participation in TULIP has enabled it to begin to develop understanding of the requirements for building a digital library. We have developed a base from which insight into issues such as archiving and storage of electronic resources, copyright questions, and article pricing, can be obtained.
Elsevier Science
Implementation Lessons Learned. One thing we have learned is that cost remains a constraint. Building the digital library requires a team of people with sufficient funding to make it happen. Not all libraries can implement things locally and not all remote applications are functionally viable now. On the positive side, then, we can see that a number of universities not only have the kind of technical sophistication needed in-house, but also can provide the dedicated teamwork and strong leadership it takes to make things happen, to actually build tools that bring information to the userÕs desktop.
As a publisher, we have a strong commitment to make information available in an electronic format, forming the foundation of the digital library. In TULIP, we have seen that tools can be developed to bring that information to the end users. In order to be successful, the digital library will have to provide the information users need, in the form they need it, when they need it, where they need it. Therefore, the next step, which we are undertaking now, is to investigate which tools and which information users really want.
Conclusions
Although focused on the electronic distribution of journals, TULIP offers valuable experience for those desiring to create digital libraries. Analysis of the project suggests: (1) currently implemented information architectures are unlikely to scale. TULIP consisted of only 43 journals, yet Elsevier and all of the institutions encountered problems with available infrastructures. As the central distributor, through its agent EI, Elsevier discovered that the Internet is not robust enough either in speed or reliability of data transfer. All of the institutions including single campuses with T3 connectivity experienced difficulty in the reliability and accuracy of the Internet service. As the local distribution agents, the universities under-estimated the technology requirements that will enable the smoothly integrated, holistic service that a traditional library offers its clients. In some cases, this was manifested in inadequate storage resulting in lack of data set currency; in others, there were gaps in the ease of the systems use and in their performance. Display either on screens or through a printed output created yet another issue as the builders constructed their digital libraries by melding it with their institution's existing information technology architecture and infrastructure. (2) Even if it were feasible to develop a digital library without consideration of any installed technology, cost would constrain progress. The transition from the print-dominant traditional library to the multimedia digital library will be costly as new technology is demanded, users and information professionals are trained, and publishers develop new forms of information tools and data sets for their customers. These costs will have to be paid resulting in new pricing policies and mechanisms by the publisher seeking return on investment, new budgeting practices by the institution that fit with information access rather than ownership, and new use patterns by researchers for whom information entitlement is unlikely to remain a governing principle. (3) Ultimately, to justify and sustain the transition to the digital library, all of the participants, publishers, institutions, and scholars must gain. It is likely that sustained investments by institutions in digital libraries will occur only if the institution finds genuine productivity benefits for its researchers; and, researchers are unlikely to change their approach to information retrieval and use until they perceive a compelling reason, again most likely in productivity. TULIP helped to crystallize the vision of the digital library for materials scientists so that they could begin to see the promise of anywhere, any time access, but it lacked the comprehensiveness to which they are accustomed in the traditional library. Consequently, it is questionable that it gained their consistent use or changed their current research strategies. Since no single institution or publisher can economically create or sustain this required comprehensive digital library, collaborations and consortiums such as TULIP are essential.
As a beginning step, TULIP demonstrated that research universities and a commercial publisher can effectively work together to gain understanding of the implications of digital library development. What is needed now is an expansion of projects like TULIP which build on previous experience and stimulate all of the players in the information chain to step outside their roles to rethink what they do and how they do it.