Digitizing Books

23 October 2007 by The General Published in: Civil War books and authors 2 comments

In August, I posted about the better features of Google’s book search feature. J. D. and I both made very extensive use of Google’s site, as well as the Microsoft live book search site (a note about the Microsoft site–it will not work on the Mac. That, in and of itself, is reason enough for me not to want to use it at all).

There is a third organization digitizing books. The Internet Archive is also digitizing public domain books and making them available. The following article appeared in today’s issue of The New York Times, and explains why I prefer the Internet Archive project:

Libraries Shun Deals to Place Books on Web
By KATIE HAFNER
Published: October 22, 2007
Several major research libraries have rebuffed offers from Google and Microsoft to scan their books into computer databases, saying they are put off by restrictions these companies want to place on the new digital collections.

The research libraries, including a large consortium in the Boston area, are instead signing on with the Open Content Alliance, a nonprofit effort aimed at making their materials broadly available.

Libraries that agree to work with Google must agree to a set of terms, which include making the material unavailable to other commercial search services. Microsoft places a similar restriction on the books it converts to electronic form. The Open Content Alliance, by contrast, is making the material available to any search service.

Google pays to scan the books and does not directly profit from the resulting Web pages, although the books make its search engine more useful and more valuable. The libraries can have their books scanned again by another company or organization for dissemination more broadly.

It costs the Open Content Alliance as much as $30 to scan each book, a cost shared by the group’s members and benefactors, so there are obvious financial benefits to libraries of Google’s wide-ranging offer, started in 2004.

Many prominent libraries have accepted Google’s offer — including the New York Public Library and libraries at the University of Michigan, Harvard, Stanford and Oxford. Google expects to scan 15 million books from those collections over the next decade.

But the resistance from some libraries, like the Boston Public Library and the Smithsonian Institution, suggests that many in the academic and nonprofit world are intent on pursuing a vision of the Web as a global repository of knowledge that is free of business interests or restrictions.

Even though Google’s program could make millions of books available to hundreds of millions of Internet users for the first time, some libraries and researchers worry that if any one company comes to dominate the digital conversion of these works, it could exploit that dominance for commercial gain.

“There are two opposed pathways being mapped out,” said Paul Duguid, an adjunct professor at the School of Information at the University of California, Berkeley. “One is shaped by commercial concerns, the other by a commitment to openness, and which one will win is not clear.”

Last month, the Boston Library Consortium of 19 research and academic libraries in New England that includes the University of Connecticut and the University of Massachusetts, said it would work with the Open Content Alliance to begin digitizing the books among the libraries’ 34 million volumes whose copyright had expired.

“We understand the commercial value of what Google is doing, but we want to be able to distribute materials in a way where everyone benefits from it,” said Bernard A. Margolis, president of the Boston Public Library, which has in its collection roughly 3,700 volumes from the personal library of John Adams.

Mr. Margolis said his library had spoken with both Google and Microsoft, and had not shut the door entirely on the idea of working with them. And several libraries are working with both Google and the Open Content Alliance.

Adam Smith, project management director of Google Book Search, noted that the company’s deals with libraries were not exclusive. “We’re excited that the O.C.A. has signed more libraries, and we hope they sign many more,” Mr. Smith said.

“The powerful motivation is that we’re bringing more offline information online,” he said. “As a commercial company, we have the resources to do this, and we’re doing it in a way that benefits users, publishers, authors and libraries. And it benefits us because we provide an improved user experience, which then means users will come back to Google.”

The Library of Congress has a pilot program with Google to digitize some books. But in January, it announced a project with a more inclusive approach. With $2 million from the Alfred P. Sloan Foundation, the library’s first mass digitization effort will make 136,000 books accessible to any search engine through the Open Content Alliance. The library declined to comment on its future digitization plans.

The Open Content Alliance is the brainchild of Brewster Kahle, the founder and director of the Internet Archive, which was created in 1996 with the aim of preserving copies of Web sites and other material. The group includes more than 80 libraries and research institutions, including the Smithsonian Institution.

Although Google is making public-domain books readily available to individuals who wish to download them, Mr. Kahle and others worry about the possible implications of having one company store and distribute so much public-domain content.

“Scanning the great libraries is a wonderful idea, but if only one corporation controls access to this digital collection, we’ll have handed too much control to a private entity,” Mr. Kahle said.

The Open Content Alliance, he said, “is fundamentally different, coming from a community project to build joint collections that can be used by everyone in different ways.”

Mr. Kahle’s group focuses on out-of-copyright books, mostly those published in 1922 or earlier. Google scans copyrighted works as well, but it does not allow users to read the full text of those books online, and it allows publishers to opt out of the program.

Microsoft joined the Open Content Alliance at its start in 2005, as did Yahoo, which also has a book search project. Google also spoke with Mr. Kahle about joining the group, but they did not reach an agreement.

A year after joining, Microsoft added a restriction that prohibits a book it has digitized from being included in commercial search engines other than Microsoft’s.

“Unlike Google, there are no restrictions on the distribution of these copies for academic purposes across institutions,” said Jay Girotto, group program manager for Live Book Search from Microsoft. Institutions working with Microsoft, he said, include the University of California and the New York Public Library.

Some in the research field view the issue as a matter of principle.

Doron Weber, a program director at the Sloan Foundation, which has made several grants to libraries for digital conversion of books, said that several institutions approached by Google have spoken to his organization about their reservations. “Many are hedging their bets,” he said, “taking Google money for now while realizing this is, at best, a short-term bridge to a truly open universal library of the future.”

The University of Michigan, a Google partner since 2004, does not seem to share this view. “We have not felt particularly restricted by our agreement with Google,” said Jack Bernard, a lawyer at the university.

The University of California, which started scanning books with the Open Content Alliance, Microsoft and Yahoo in 2005, has added Google. Robin Chandler, director of data acquisitions at the University of California’s digital library project, said working with everyone helps increase the volume of the scanning.

Some have found Google to be inflexible in its terms. Tom Garnett, director of the Biodiversity Heritage Library, a group of 10 prominent natural history and botanical libraries that have agreed to digitize their collections, said he had had discussions with various people at both Google and Microsoft.

“Google had a very restrictive agreement, and in all our discussions they were unwilling to yield,” he said. Among the terms was a requirement that libraries put their own technology in place to block commercial search services other than Google, he said.

Libraries that sign with the Open Content Alliance are obligated to pay the cost of scanning the books. Several have received grants from organizations like the Sloan Foundation.

The Boston Library Consortium’s project is self-funded, with $845,000 for the next two years. The consortium pays 10 cents a page to the Internet Archive, which has installed 10 scanners at the Boston Public Library. Other members include the Massachusetts Institute of Technology and Brown University.

The scans are stored at the Internet Archive in San Francisco and are available through its Web site. Search companies including Google are free to point users to the material.

On Wednesday the Internet Archive announced, together with the Boston Public Library and the library of the Marine Biological Laboratory and Woods Hole Oceanographic Institution, that it would start scanning out-of-print but in-copyright works to be distributed through a digital interlibrary loan system.

The Internet Archive Project’s system is far superior to that of either Google or Microsoft because it’s the only truly open one available to the world. And it’s why I prefer to use its resources whenever possible.

Scridb filter

Comments

David Woodbury

Tue 23rd Oct 2007 at 10:55 pm
Just for the record, the Microsoft site works perfectly well with the latest version of Firefox (and maybe earlier ones), and OSX 10.4.10. I just browsed through a couple books there on my beautiful new iMac.

Looks like Safari has a problem with that site.

David
Valerie Protopapas

Wed 24th Oct 2007 at 2:10 pm
Just about to comment that I have used the site on my Mac. However, despite all attempts to ‘refine’ my search, the site will not let you look at any more than 250 ‘hits’ despite reporting over 2,000 possible responses. Every effort that I’ve made has been thwarted and finally I was told that the site would not let ANYONE accessing same to see past 250 hits. They said that this limitation was placed upon the user because people ‘wouldn’t want’ to search more than 250 hits – which, of course, is hogwash! What’s the sense of turning up 2,000+ hits if you can’t access them?

Furthermore, when I asked if it would be possible to somehow eliminate the first 250 that show up (and keep showing up) – a sizable percentage of which are not pertinent to my search – and then see other ‘hits’ that come up after the first 250, I was told no. Again, I fail to see of what use that is if the site CAN give you over 2000 possible responses to keywords but limits what you can see to the first 250 – the SAME first 250 – every time! It’s like being told that something is ‘out there’, but we’re not going to let you know what it is! MOST annoying!

Comments are closed.

Digitizing Books

Comments

Sponsors

Blogs I like

Civil War Sites

Compilation Blogs

Other Sites

Our sponsors

Publishers of Civil War History

Archives

We Are A Top Blog!

Meta