bibtex2rdf - A configurable BibTeX to RDF Converter

Introduction

Many researchers use LaTeX to prepare their publications. They typically collect bibliographic references in the BibTeX format, because that way LaTeX can generate Reference lists automatically. In BibTeX, each reference is captured as one entry consiting of (tagged) fields. For an introduction to BibTeX see the references section. With the advent of the semantic web, several projects have started to translate this bibliographic information to RDF. bibtex2rdf is a highly configurable translator from BibTeX to RDF which allows to do exactly that. How it works and how to use it is explained in this document.

Usage

The most direct approach to understand how a program works is just to try it. Therefore, we start with the usage section.

bibtex2rdf is provided as jar file (the source code will be published soon). The current version is 1.0 beta 5. Download it here. It is a command line tool and needs a Java JDK/JRE 1.5 as prerequisite. To translate sample.bib to sample.rdf, type
    java -jar bibtex2rdf.jar sample.bib sample.rdf

The complete call syntax is java -jar bibtex2rdf.jar [-schema <file>] [-baseuri <uri>] [-enc <enc>] <bibtex> [<output>]. For parameter explanation, see the following table. The application generates a log file (bibtex2rdf.log) which contains all warnings and errors.

Parameter
Description
-schema <file> optional schema file. see section Mapping Configuration
-baseuri <uri> prepend all generated uris with the specified base uri. If omitted, file local URIs will be used
-enc <enc> use specified encoding. Default is ISO-8859-1. To generate Unicode format, use UTF-8 or UTF-16.
<bibtex> this file is translated to RDF. If it is a directory, bibtex2rdf scans it (and its sub-directories) and translates all files found which have a .bib suffix
<output> the result is written to this file. If omitted, the result is written to stdout

Feedback

If you detect a bug or want to suggest an additional feature, please send me a mail. My mail address can be found on my homepage

The mapping

The mapping is higly configurable. This section describes the default mapping. For a description of all mapping configuration options see the section Mapping Configuration.

For each BibTeX entry at least one resource is generated. If an entry has authors and/or editors, we generate a separate resource to describe each author. Thus, it is possible to use the same resource if one person has authored or edited several publications. The same applies to publications which are part of a collection (conference, journal, book, etc.). In this case the collection is modelled as a separate resource, and all fields which relate to the collection instead of the publication become also properties of the collection (e.g. publisher, editor, address, year, month). We do not (yet) create a resource to identify a journal or conference series, in other words: each journal number is modeled as a separate collection. Also, in contrast to the person data we do not (yet) attempt to identify identical collections and merge them to one resource.

Example

The second-most direct approach to understand a translation is an example. Therefore we present it here before going into the details. The BibTeX entry

@InProceedings{aberer2003chatty,
  author =   {Karl Aberer and Philippe Cudr�-Mauroux and Manfred Hauswirth},
  title =    {The Chatty Web: Emergent Semantics Through Gossiping},
  booktitle = {Proceedings of the Twelfth International World Wide Web Conference},
  location = {Budapest, Hungary},
  year =     {2003},
  month =    {May},
  pages = {197--206},
  publisher = {ACM Press},
  address =  {New York, USA},
  url = {http://www2003.org/cdrom/papers/refereed/p471/471-aberer.html}
}
is converted to
<?xml version='1.0' encoding='ISO-8859-1'?>
<rdf:RDF
    xmlns:bibtex="http://www.edutella.org/bibtex#"
    xmlns:dct="http://purl.org/dc/terms/"
	xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:vcard="http://www.w3.org/2001/vcard-rdf/3.0#"
    xmlns:dc="http://purl.org/dc/elements/1.1/">

    <bibtex:InProceedings rdf:about="aberer2003chatty"
     dc:date="2003-05"
     bibtex:pages="197-206">
    <dc:title>The Chatty Web: Emergent Semantics Through Gossiping</dc:title>
    <dct:isPartOf>
      <bibtex:Proceedings
         dc:date="2003-05">
        <dc:title>Proceedings of the Twelfth International World Wide Web Conference</dc:title>
        <vcard:ADR
           vcard:Locality="Budapest"
           vcard:Country="Hungary"/>
        <dc:publisher rdf:resource="aberer2003chatty:ACM_Press"/>
    </bibtex:Proceedings>
    </dct:isPartOf>
    <dc:identifier>http://www2003.org/cdrom/papers/refereed/p471/471-aberer.html</dc:identifier>
    <dc:publisher rdf:resource="aberer2003chatty:ACM_Press"/>
    <dc:creator>
      <rdf:Seq>
        <rdf:li rdf:resource="aberer2003chatty:Aberer_Karl"/>
        <rdf:li rdf:resource="aberer2003chatty:Cudré-Mauroux_Philippe"/>
        <rdf:li rdf:resource="aberer2003chatty:Hauswirth_Manfred"/>
      </rdf:Seq>
    </dc:creator>
  </bibtex:InProceedings>

  <bibtex:Person rdf:about="aberer2003chatty:Aberer_Karl">
    <vcard:FN>Karl Aberer</vcard:FN>
    <vcard:N
       vcard:Family="Aberer"
       vcard:Given="Karl"/>
  </bibtex:Person>
  <bibtex:Person rdf:about="aberer2003chatty:Hauswirth_Manfred">
    <vcard:N
       vcard:Given="Manfred"
       vcard:Family="Hauswirth"/>
    <vcard:FN>Manfred Hauswirth</vcard:FN>
  </bibtex:Person>
  <bibtex:Person rdf:about="aberer2003chatty:Cudré-Mauroux_Philippe">
    <vcard:FN>Philippe Cudré-Mauroux</vcard:FN>
    <vcard:N
       vcard:Given="Philippe"
       vcard:Family="Cudré-Mauroux"/>
  </bibtex:Person>

  <bibtex:Organization rdf:about="aberer2003chatty:ACM_Press">
    <vcard:ADR rdf:parseType="Resource">
      <vcard:Country>USA</vcard:Country>
      <vcard:Locality>New York</vcard:Locality>
    </vcard:ADR>
    <vcard:FN>ACM Press</vcard:FN>
  </bibtex:Organization>
</rdf:RDF>

Used RDF Schemas

For our mapping, we have tried to use as many existing RDF schema elements as possible, instead of inventing a new schema. However, this was not possible in all cases. For example, there is no common classification similar to the BibTeX entry types. We use the following namespace and abbreviations:
Abbreviation
Namespace URI
Comment
dc http://purl.org/dc/elements/1.1/ Dublin Core Standardized Schema for basic bibliographic metadata
dct http://purl.org/dc/terms/ Dublin Core Metadata Terms to refine the basic elements
vcard http://www.w3.org/2001/vcard-rdf/3.0# RDF mapping of vCard person and address data
bibtex http://www.edutella.org/bibtex# New schema to capture all remaining elements

Standard entry types and their mapping

(Descriptions taken from The BibTex Format)
BibTeX Type
RDF Type
Comment
@article bibtex:Article An article from a journal or magazine.
@book bibtex:Book A book with an explicit publisher.
@booklet bibtex:Booklet A work that is printed and bound, but without a named publisher or sponsoring institution.
@conference bibtex:InProceedings The same as inproceedings.
@inbook bibtex:InBook A part of a book, which may be a chapter (or section or whatever) and/or a range of pages.
@incollection bibtex:InCollection A part of a book having its own title.
@inproceedings bibtex:InProceedings An article in a conference proceedings.
@manual bibtex:Manual Technical documentation.
@mastersthesis bibtex:MastersThesis A Master's thesis.
@misc bibtex:Misc Use this type when nothing else fits.
@phdthesis bibtex:PhDThesis A PhD thesis.
@proceedings bibtex:Proceedings The proceedings of a conference.
@techreport bibtex:TechReport A report published by a school or other institution, usually numbered within a series.
@unpublished bibtex:Unpublished A document having an author and title, but not formally published.
Non-standard entry types are not supported.

Standard fields

< /tr>
BibTeX field
RDF property
BibTeX Comment
RDF Mapping Comment
address vcard:Locality, vcard:Country For journals, books, etc. usually the address of the publisher or other type of institution. For proceedings, often the location of the event If the field contains a comma the text after the last comma is taken as vcard:Country, everything before the comma as vcard:Locality
annote bibtex:annote An annotation. It is not used by the standard bibliography styles, but may be used by others that produce an annotated bibliography. The content in this field is not cleared from special TeX formatting, but left unchanged
author dc:creator The name(s) of the author(s), in the format described in the LaTeX book. Authors are listed as element of an rdf:Seq. For each author a resource with the properties vcard:FN and vcard:N are created. the vcard:N resource gets the properties vcard:Family, and vcard:Given, vcard:Others, vcard:Prefix and vcard:Suffix, if these parts appear in the name.
booktitle dct:isPartOf Title of a book (or other collection), part of which is being cited. Booktitles are handled as collections. If the entry is of type inproceedings, the representing resource is typed bibtex:Proceedings
chapter bibtex:chapter A chapter (or section or whatever) number. default mapping. The LaTeX commands commonly used in bibtex files (e.g. accents) are handled
crossref dct:isPartOf The database key of the entry being cross referenced. Any fields that are missing from the current record are inherited from the field being cross referenced. A reference to the resource created for the crossref'd entry is created
edition bibtex:edition The edition of a book---for example, ``Second''. default mapping
editor dc:creator Name(s) of editor(s), typed as indicated in the LaTeX book. If there is also an author field, then the editor field gives the editor of the book or collection in which the reference appears. the editor is added to the collection resource, as author.
howpublished bibtex:howpublished How something strange has been published. default mapping
institution bibtex:institution The sponsoring institution of a technical report. This field is handled as organization
journal dct:isPartOf A journal name. Journals are handled as collections. Resources representing Journals are typed bibtex:Journal.
key dc:identifier Used for alphabetizing, cross referencing, and creating a label when the ``author'' information is missing. default mapping
location vcard:Locality, vcard:Country A location associated with the entry, such as the city in which a conference took place. handled as address
month dc:date The month in which the work was published or, for an unpublished work, in which it was written. the month field is merged with the year field to form the dc:date property
note bibtex:note Any additional information that can help the reader. The content in this field is not cleared from special TeX formatting, but left unchanged
number bibtex:number The number of a journal, magazine, technical report, or of a work in a series. default mapping. This information is added to the collection resource
organization bibtex:organization The organization that sponsors a conference or that publishes a manual. This field is handled as organization
pages bibtex:pages One or more page numbers or range of numbers, such as 42--111 or 7,41,73--97 or 43+ Consecutive hyphen chars are transformed to exactly one hyphen.
publisher dc:publisher The publisher's name. This field is handled as organization
school bibtex:school The name of the school where a thesis was written. This field is handled as organization
series dc:isPartOf The name of a series or set of books. this field is handled as collection
title dc:title The work's title, typed as explained in the LaTeX book. default mapping
type bibtex:type The type of a technical report---for example, ``Research Note''. default mapping
url dc:identifier The WWW Universal Resource Locator that points to the item being referenced default mapping
volume bibtex:volume The volume of a journal or multi-volume book. default mapping. This information is added to the collection resource
year dc:date The year of publication or, for an unpublished work, the year it was written. In W3CDTF format. If the month field is available, month and year information are merged.

Mapping Configuration

The mapping can be configured via a Java properties file. We show a commented sample file with commented entries which represents the default configuration. Exactly this default configuration is used when not specifying a '-schema' argument. If you write your own configuration files, you need only to include properties which values are different from the default. See below for links to other configuration files.

Important Note: the mapping configuration file format is subject to change in upcoming beta versions.

 

##########################################################
# Namespaces

##########################################################


# declare namespaces in the form
# ns_<shorthand>=<uri>
ns_rdfs=http://www.w3.org/2000/01/rdf-schema#
ns_dc=http://purl.org/dc/elements/1.1/
ns_dct=http://purl.org/dc/terms/ ns_vcard=http://www.w3.org/2001/vcard-rdf/3.0# # if a namespace with shorthand 'unknown' is declared, # this namespace is used to create RDF property names # for unknown BibTeX fields. # By default, unknown fields are ignored. #ns_unknown=http://www.edutella.org/bibtex_unknown/ # if a namespace with shorthand 'bibtex' is declared, # this namespace is used to create RDF property names # for known BibTeX types and fields, if they # arent mapped to DC or RDF properties ns_bibtex=http://www.edutella.org/bibtex# ########################################################## # Flags ########################################################## # flags control the way the output is structured # create a property where year and month are merged as # one property (in the form YYYY-MM) createDate=true # try to create an address resource and split the # address field into components (Locality and Country) createAddressResource=true # create a Seq for author and editor lists. # if this flag is set to false, each author/editor # is added directly as property to the entry resource createSeqForPersonList=true # create a separate resource for each author/editor. # If this flag is false, the fullname is used as # property value. createPersonResource=true # create separate resources for collections # (proceedings, journals, etc.). # If this flag is false, the collection title # (and all other collection related information) # is added directly to the entry resource. createCollectionResource=true # add a Seq of all generated entries to the output. # creates a Seq which contains all entry references. # this allows to preserve the entry order information. # the URI of this sequence will be <baseUri>+"referenceList" createEntryList=true # add datatype declarations to all literals createDatatypes=false # to overwrite default datatypes, use the following entries # all other fields are of type xsd:string and currently not overwritable yearType=http://www.w3.org/2001/XMLSchema#nonNegativeInteger numberType=http://www.w3.org/2001/XMLSchema#nonNegativeInteger volumeType=http://www.w3.org/2001/XMLSchema#nonNegativeInteger chapterType=http://www.w3.org/2001/XMLSchema#nonNegativeInteger dateType=http://www.w3.org/2001/XMLSchema#gYearMonth ########################################################## # Field output lists ########################################################## # field output lists have three purposes: # - they allow to restrict the output to a selected subset # of fields # - they allow to specify which fields are collection # information and which are entry information # - they allow to specify if some special output is # requested which doesn't directly correspond to # a BibTeX field # Such additional properties are: # - sourceFile: outputs source file information # - label: adds a label to a resource # - shortTitle: tries to extract a short title from # the title and adds it as separate # property # # as shorthand, the pseudo-field "all" is allowed to # specify that all fields for which a mapping is available # should be mapped. Note that the latter three special # properties are not included in "all". To get these, you # have to specify them additionally, as in "all, label, sourceFile". # for BibTeX entries, output all fields, but nothing special. entryProperties=all # for person and organization resources, output all # available fields. This is a full name and a # structured name resource according to vCard. personProperties=all # assign the following fields to collection resources. # this is the default for all collection types. collectionProperties=address, booktitle, crossref, editor, journal, location,\ month, number, publisher, series, volume, year, shortTitle
# if you want to assign different fields to specific collection # types, you can overwrite the default by setting the following properties. proceedingsProperties=address, booktitle, location, publisher, month, volume, year journalProperties=address, journal, month, number, publisher, volume, year seriesProperties=publisher,series bookProperties=booktitle, editor, series, year # if you only want the fullname, use # personProperties=personFullname # if you want to output some fields just as strings, add them to the following list verbatimProperties=note, annote, key
########################################################## # Type mappings ########################################################## # Types start with an upper case letter # Default entry types and their associated RDF types Article=bibtex:Article
Book=bibtex:Book
Booklet=bibtex:Publication
InBook=bibtex:InBook
InCollection=bibtex:InCollection
InProceedings=bibtex:InProceedings
Manual=bibtex:Manual
MastersThesis=bibtex:Masterthesis
Misc=bibtex:Misc
Periodical=bibtex:Publication
PhdThesis=bibtex:PhDThesis
Proceedings=bibtex:Proceedings
TechReport=bibtex:TechnicalReport
Unpublished=bibtex:Unpublished Conference=bibtex:Conference
# You may add new non-standard entry types # which will be translated according to the # specified mapping # Matharticle=bibtex:Article
# Mastersthesis=bibtex:Masterthesis # Masterthesis=bibtex:Masterthesis # Mscthesis=bibtex:Masterthesis # Periodical=bibtex:Publication # Types assigned to person and organization # resources from the corresponding BibTeX field. Author=bibtex:Person Editor=bibtex:Person Organization=bibtex:Organization Institution=bibtex:Organization School=bibtex:Organization Publisher=bibtex:Organization # Types assigned to collection resources # They are inferred from the entry type # and from the BibTeX field. # Proceedings and Books are already defined above # collection for @article Journal=bibtex:Journal # collection for series field Series=bibtex:Series # everything else Collection=bibtex:Collection # special cases # resource for the 'and other' author/editor part EtAl=bibtex:EtAl # type of resource which represents the source file BibFile=bibtex:SourceFile ########################################################## # Field mappings ########################################################## # fields start with a lower case letter # address related fields address=vcard:ADR location=vcard:ADR # date related fields year=bibtex:year month=bibtex:month # title related fields title=dc:title # collection related fields # # Note that in the collection resource these # fields are always mapped to the title property. # # if you set createCollectionResource to false, # you also need to change the mapping for these fields. booktitle=dct:isPartOf journal=dct:isPartOf series=dct:isPartOf crossref=dct:isPartOf # person or organization related fields author=dc:creator editor=bibtex:editor publisher=dc:publisher institution=bibtex:institution organization=bibtex:organization school=bibtex:school # identifier fields url=dc:identifier key=dc:identifier # all other bibtex fields annote=bibtex:annote chapter=bibtex:chapter edition=bibtex:edition howpublished=bibtex:howpublished note=bibtex:note number=bibtex:number pages=bibtex:pages type=bibtex:type volume=bibtex:volume # fields derived from BibTeX information #used if createAddressResource addressCountry=vcard:Country addressLocality=vcard:Locality # used for the merged date date=dc:date # used for person and organization resources personFullname=vcard:FN personStructuredName=vcard:N # the structured name has several parts. # "Charles Louis Xavier Joseph de la Vallee Poussin Jr" is # split as follows:
# nameFamily = "Vallee Poussin"
# namePrefix = "de la"
# nameSuffix = "Jr"
# nameGiven = "Charles" # nameOther = "Louis Xavier Joseph"
nameFamily=vcard:Family namePrefix=vcard:Prefix nameSuffix=vcard:Suffix nameGiven=vcard:Given nameOther=vcard:Other # property used to attach a label label=rdfs:label # While Persons and Organizations always get their full name as label, # you can specify a label pattern for entries. Use <field> to refer # to a BibTex field and 'text' to include fixed text. # you can concatenate any elements using +. # to add label components only if a specific field x exists, use # (<x>: ...), e.g. (<year>: ', '+<year>) # This is the default setting: defaultLabelPattern=<title> # if you want to use different pattern for different types, # you can overwrite the default, e.g.: # articleLabelPattern=(<author>:<author>+'. ')+<title>+(<journal>:'. '+<journal>+(<volume>:' '+<volume>+(<number>:'('+<number>+')')))+(<year>:', '+<year>) # property used to attach source file information sourceFile=bibtex:sourceFile # property used to add the absolute path as string # to the source file resource fileAbsolutePath=bibtex:absolutePath

Other mapping tools and definitions

There are already several RDF mappings out there. For all we know we provide a configuration file. Note that no deep analysis of these mappings has taken place. Therefore the configuration file might produce slightly different results than the original converter.

BibTeX-2-RDF translator

This is an online service provided by the SemanticWeb@VU initiative at Vrije Universiteit Amsterdam. It is available at http://www.cs.vu.nl/~mcaklein/bib2rdf/. This translator creates person and organization resources, but no collection resources. Also, all fields are translated into properties of the same schema. Use the file VUMapping.properties (coming soon) to create similar output.

Java BibTeX-To-RDF Converter

A Java application for conversion is provided at AIFB, Universität Karlsruhe as part of the SWAP project. It is downloadable at http://www.aifb.uni-karlsruhe.de/WBS/pha/bib/index.html. This translator creates person and organization resources, and adds source file information to them. The file SWRCMapping.properties creates similar output, with the following exceptions: a) In the original output, the RDF type and property URIs don't have a namespace. According to Jena 2.0 this isn't valid anymore. b) while person resources are created, the author/editor property isn't added to the entry resources; as this seems to be just an omission, we add these properties.

Visus BibTex Ontology in OWL

An OWL Ontology for BibTex is available at http://visus.mit.edu/bibtex/0.1/. This seems to be just a flat schema. An RDF model according to this schema can be generated with VisusMapping.properties.

MarcOnt Initiative BibTex-to-RDF Converter

As part of its mediator architecture, MarcOnt also provides a translater from BibTex to RDF (see http://www.marcont.org/mediations.jsp). The OWL schema used is available at http://www.marcont.org/marcont/marcont.owl and refers to the Visus ontology. The output created by the online converter is really flat, even all author names are put into one String. An RDF model according to this schema can be generated with MarcOntMapping.properties.

Acknowledgments

bibtex2rdf uses several open source libraries to fulfill its task:

Library
Description
JavaBib a highly capable BibTeX parser, provided by Johannes Henkel
Jena the well-known RDF library provided by HP Research Labs
Log4J the also well-known logging toolkit provided by Apache Software Foundation
MacBinary Toolkit 2 for Java provides an AccentComposer to create accented unicode chars; provided by Gregory Guerin

References