Poster:
|
stbalbach |
Date:
|
September 20, 2011 03:18:25pm |
Forum:
|
texts
|
Subject:
|
Re: Inconsistent author names |
The problem you mention is more than just cosmetic. When building targeted searches, one has to be aware of the many ways an authors name might be in the database. This requires complex search strings - so complex in fact the string actually can exceed 1024 characters (or whatever the max is) making it impossible to search for all books by an individual author (see below for an example). I've asked info@archive.org a number of times if they could simply increase the max search string length but never received a reply, it seems to me a rather trivial thing to do, just increase the variable size and recompile (assuming there's not already a cfg file for it).
--
With that said, it helps to understand some things about IA in order to be more forgiving of its limitations.
1. It is not a company, but a non-profit. It has limited staff and resources. It runs triage with those resources ie. there is more work than resources available. I don't always agree with those priorities, but in the end they add a lot of new books every day! That's the most important thing.
2. The metadata is entered by various entities. Maybe one book was entered by Microsoft, another by the University of Michigan, another by John Smith a user who uploaded it on his spare time, another by the Federal Govt. Maybe the data was imported from an old database, maybe it was created fresh just for IA, maybe it was done 10 years ago. There is not a single person or entity responsible for making sure the data is consistent.
3. There is no direct way for end-users to modify the metadata at archive.org (other than notes in the review field). But there is openlibrary.org which is sort of a Wikipedia-like interface to Internet Archive which anyone can edit.
---
Example search string to catch many occurrences of Robert Louis Stevenson in the database. Note: this search sting is almost at the maximum length allowed, it would be easy to construct a search string longer than this that would kick back an error.. but it's the type of searches needed for the complex nature of this database.
mediatype:(texts) (subject:"Stevenson, Robert Louis, 1850-1894" OR subject:"Stevenson, R. L. (Robert Louis), 1850-1894" OR subject:"Stevenson, Robert L. (Robert Louis), 1850-1894" OR subject:"Stevenson, Robert Louis" OR subject:"Stevenson, R. L. (Robert Louis)" OR subject:"Stevenson, Robert L. (Robert Louis)" OR subject:"Robert Louis Stevenson" OR subject:"Robert L. Stevenson" OR subject:"R. L. Stevenson" OR creator:"Stevenson, Robert Louis, 1850-1894" OR creator:"Stevenson, Robert Louis, Sir, 1850-1894" OR creator:"Stevenson, R. L. (Robert Louis), 1850-1894" OR creator:"Stevenson, Robert L. (Robert Louis), 1850-1894" OR creator:"Stevenson, Robert Louis" OR creator:"Stevenson, R. L. (Robert Louis)" OR creator:"Stevenson, Robert L. (Robert Louis)" OR creator:"Robert Louis Stevenson" OR creator:"Robert L. Stevenson" OR creator:"R. L. Stevenson" OR title:"Robert Louis Stevenson" OR title:"Robert L. Stevenson" OR title:"R. L. Stevenson" OR description:"Robert Louis Stevenson" OR description:"Robert L. Stevenson" OR description:"R. L. Stevenson" OR description:"Stevenson, Robert Louis" OR description:"Stevenson, R. L. (Robert Louis)" OR description:"Stevenson, Robert L. (Robert Louis)")
Poster:
|
stbalbach |
Date:
|
September 20, 2011 07:01:59pm |
Forum:
|
texts
|
Subject:
|
Re: Inconsistent author names |
True 2.2% missing seems not bad .. but I bet if you scrolled through the 1547 set, you'll find books that don't belong, while books that should be there are not included, making that missing percent larger. Like, what about books under "R.L. Stevenson", they would not show up in the 1547 set, other examples like that. It's the problem of inconsistent author names that require customized searches that can be longer than search strings allow.