Wednesday, February 4, 2009

Digitization: Open Source vs. M$, and Time for Servers

Many people assume that digitization of an archive or records system means scanning things into a computer. If digitization were as simple as someone using the flatbed Epson to scan in the photographic archives of the Texas Tech football team playing their way toward the 1939 Cotton Bowl, then most digitization specialists would be grossly overpaid.

Digitization processes, at the minimum, require three components to be successful. For this example, I will continue with my photograph metaphor. First the process of scanning must follow a set of accepted standards to provide for consistency of digital development over the course of years and technologies. In other words, these standards provide for digital preservation with the primary intent of accessibility.

Second, a metadata standard must be agreed upon by all project managers. DSpace, the program that I am testing at the Southwest Collection, uses Dublin Core metadata standards by default. Although Dublin Core can be swapped out with another standard system, it is important to note that moving away from the default of Dublin Core provides for a fair bit of extra work on the server backend. A metadata librarian chooses to represent a digitized item within the metadata system, and this representation, whether in the form of Dublin Core or something else, offers a lot of control, on behalf of the metadata librarian, over how the item will be searched, what digital collections it will belong to, etc.

The third component, and perhaps the most interesting component in a digital project, is good editing judgment. A digitization project manager, often a faculty member in the archive here at the Southwest Collection, must consider how best to represent a photograph—do well-taught students using markup determine how the photograph is searched; does the faculty member guide student assistants in the keywords they are allowed to use to describe an item; if the back of a photograph has any sort of information on it, should that be scanned in? These types of judgments must adhere both to standards and to editorial representation. If a university archive is providing a digitized version for research access, the archive must consider what best preserves the state of the information as represented in the physical photograph to the greatest extent. If a museum is presenting the same digitized version of the photograph, the museum will have to use different metadata standards to represent the photo as part of a wider collection, for instance in an electronic exhibit format.

Where do DSpace and open source operating systems fit into these questions? Let me start with open source operating systems: M$ provides an easy-to-use, fast, quick-and-dirty server program that quickly integrate information onto the Internet, onto network shares, into databases, and so forth. The problem, however, is that much of the M$ software is proprietary, expensive, and bulky. The M$ operation systems are prepackaged with hundreds of unnecessary services that, really, could be disabled, as long as the administrator knows which ones are unnecessary for performance. Additionally open source software is very difficult to run on M$ because of the various compatibility issues, and it is usually much easier to run off of open source servers.

Open source operating systems provide a free and quick means of manipulating and serving data, as long as the administrator knows what s/he is doing with the software. A lot of people run screaming when they hear “Unix,” or “Linux,” and most don’t even know what “FreeBSD” means; it is important for admins interested in Unix or Linux to realize that the act of running the open source operating system will give them an enormous amount of control over what actually runs on that server.

An open source server administrator who knows what s/he is doing can format, rebuild, and reinstall an operating system in approximately three hours, which includes reinstalling all necessary software packages-- given the opportunity for unwavering attention to the matter at hand. When an open source content management system like Plone crashes on a server administrator, all that is really necessary is a single service restart: /zope/bin/runzope restart, for instance, without having to rely on five other dependencies to perform this command. Another example, then, is how DSpace runs. If a service interruption occurs with Dspace, for instance on my Gentoo test server, all I have to do is /etc/init.d/tomcat-6 restart, /etc/init.d/postgresql restart, and that is that. The most important thing on the server end of running a digitization server, in fact, is monitoring where your database is being stored—keep DSpace on a different disk than your database. (A consideration of spindle placement is important, too, in building a larger digitization server. But more later about that.)

As I am testing with DSpace, I need to learn more about PostgreSQL, which brings up my next point. A serious danger in IT is allowing oneself a very wide breadth of knowledge without a much needed depth. The job in itself necessitates a learning adaptability for new software, new hardware, and even new user learning styles; unfortunately, the greatest limitation to any server administrator is time. Between constant interruption and poor funding, often new software and hardware that should be put into production as soon as it is purchased or built sits in a corner, rotting into obsolescence, simply because the administrator has had no time to build, test, secure the new product, nor time to train users on it. This, at least, has been my observation of other friends in IT in the academic setting, and it has certainly been my own experience. Unless one focuses without interruption, there is a tradeoff of depth of knowledge for flexibility of learning.

Tests servers offer admins the opportunity to understand the nuts and bolts of what goes on. Also, in testing with Linux or Unix (like FreeBSD), I highly recommend that you avoid using the GUI entirely. If you are going to learn the operating system, learn the command line first. Otherwise, you’re really stinting yourself on a good opportunity to know exactly how things function.

No comments:

Post a Comment