Developer guide

Guidelines and notes for Python coding style

  1. Note that one can mix 8-bit Python strings (ASCII text) with UTF-8 encoded text as long as the 8-bit string contains only ASCII characters.

  2. Keep in mind when running into Unicode errors: reading a line of text from a file produces a line of bytes and not characters. To decode the bytes into a string of characters, you need to know the encoding.

  3. There are a couple of minor points where the Bibulous coding standards deviates from Python’s PEP8:

    1. A line width of 120 is the standard (not 80).
    2. In general, statements that evaluate to a boolean are placed within parentheses (i.e. if (a < b): rather than if a < b:).
  4. Many developers prefer to spread out code among a large number of small files, but Bibulous is currently organized in a single large file. This is partly because there is no large block of code that fits by itself so that a separate file makes sense. (Parsing of .bib files, for example, only requires a couple hundred lines.

Overall project strategy and code structure

The basic function of BibTeX is to accept an .aux file as input and to produce a .bbl file as output. The aux file contains all of the citation information as well as the filenames for the bibliography database file (.bib) and the style file (.bst).

The basic program flow is as follows:

  1. Read the .aux file and get the names of the bibliography databases (.bib files), the style templates (.bst files) to use, together with the entire set of citations.
  2. Read in the Bibulous style template file as a dictionary (bstdict).
  3. If the use_citeextract keyword is set to True, and if an “extracted” database file exists, then compare the citations in the extracted database against those in the .aux file. If there are any differences, then re-extract the database. Otherwise, use the extracted database rather than the full one specified in the .aux file.
  4. Read in all of the bibliography database files into one long dictionary (bibdata), replacing any abbreviations with their full form. In an “extracted” database, all entries are parsed, whereas in any other type of database file, only those entries whose keywords are found in the citation list are actually parsed. All other entries have their data saved as unparsed strings. Cross-referenced data is not yet inserted at this point. That is delayed until the time of writing the BBL file in order to speed up parsing. It is only then that the cross-referenced entries have their data parsed into dictionary form.
  5. Now that all the information is collected, we can generate the .bbl file. Create the list of sortkeys, then go through each corresponding citation key in turn, and find the corresponding entry key in bibdata. If there is crossref data, then fill in missing values here. Also create the “special variables” here. Finally, from the entry type, select a template from bstdict and begin inserting the variables one-by-one into the template.

Because the .bib file is highly structured, it is straightforward to write a parser by hand in Python: the parse_bibfile() method converts the .bib file contents into a Python dictionary (the Bibdata class’ bibdata). The .aux file is even easier to parse, and the parse_auxfile() method converts the citation information into the Bibdata class’ citedict dictionary.

The Bibdata class thus holds all relevant information needed to operate on a bibliography and generate the output LaTeX-formatted .bbl file.

Parsing BIB files

parse_bibfile()

The strategy for parse_bibfile() is to find each individual bibliography entry, determine its entry type, and save all of the text between the entry’s opening and closing braces as one long string, to be passed to parse_bibentry() for further parsing. To gather the entry data string, we first look for a line that starts with @. On that line, we look for a string after the @ followed by {, where the string gives the entry type. After we know the entry type, we look for the corresponding closing brace. If we don’t find it on the same line, then we read in the next line, and so forth, concatenating all of the lines into one long “entry string” until we encounter the corresponding closing brace. Once we have this extended “entry string” we feed it to parse_bibentry() to generate the bibliography data. Once we have come to the end of a given entry, we continue reading down the file looking for the next ‘@’ and so on.

Although this approach effectively means that we have to pass twice through the same data, dealing with brace-matching can otherwise become a mess for the BibTeX format, since it allows nested delimiters, is not directly compatible with regular expressions.

parse_bibentry()

Because parse_bibfile() has already split the data by individual entry, parse_bibentry() only needs to worry about parsing a single entry, and there are five possible formats for the entry string passed to the function:

  1. If the entrytype is a comment, then skip everything, adding nothing to the database dictionary.
  2. If the entrytype is a preamble, then treat the entire entry contents as a single fieldvalue. Append the string onto the preamble value in the bibdata dictionary.
  3. If the entrytype is a acronym, then get the entrykey and copy it into the name field. The remainder of the string is a single field value (the full form of the acronym); copy that into the description field.
  4. If the entrytype is a string (i.e. an abbreviation), then there is no entrykey. Get the fieldname (abbreviation key), and the remainder of the string is a single field value (the full form of the abbreviated string). Add this key-value pair to the abbrevs dictionary.
  5. If the entry is any other type, then get the entrykey, and the remainder of the string is a series of field-value pairs.

Once it determines which of these four options to use, parse_bibentry() extracts the entry key (if present), it locates each individual field and separates out the string corresponding to the key-value pair for each field. It does not actually parse the individual fields. For that, it loops over each field with a call to parse_bibfield() to extract the field key-value pairs.

parse_bibfield()

parse_bibfield() is the workhorse function of the BIB parsing. And because of BibTeX’s method for allowing concatenation, use of abbreviation keys, and use of two different types of delimiters ("..." or {...}), this function is a little messy. However, for the format of a given field, there are four parsing possibilities:

  1. If the field begins with a double quote " then scan until you find the next unnested ". Add that to the result string. If the ending " is followed by a comma, then the field is done; return the result string. If the ending is followed by a # then expect another field string. Scan for it and append it to the current result string.
  2. If the field begins with { then scan until you resolve the brace level. This should be followed by a comma, since no concatenation is allowed for brace-delimited fields. Otherwise issue a syntax error warning.
  3. If the field begins with a # (concatenation operator) then skip whitespace to the next character set, where you should expect a quote-delimited field. Append that to the current result string.
  4. If the field begins with anything else, then the substring up until the first whitespace character represents an abbreviation key. Locate it and substitute it in. If you don’t find the key in the abbrevs dictionary, give a warning and continue on.

Parsing AUX files

The .aux file contains the filenames of the .bib database file and the .bst style template file, as well as the citations. The get_bibfilenames() method scans through the .aux file and locates a line with \bibdata{...} which contains a filename or a comma-delimited list of filenames, giving the database files. Another line with \bibstyle{...} gives the filename or comma-delimited list of filenames for style templates. The filenames obtained are saved into the filedict attribute – a dictionary whose keys are the file extensions aux, bbl, bib, bst, or tex.

The parse_auxfile() method makes a second pass through the .aux file, this time looking for the citation information. (Auxiliary files are generally quite small, so taking multiple passes through them costs very little time.) Each line with \citation{...} contains a citation key or comma-delimited list of citation keys – each one is added into the citation dictionary (citedict), with a value corresponding to the citation order.

Parsing BST files

Parsing a .bst file basically involves looking for one of several syntactical structures.

  1. First, any # present in a line indicates a comment. All text following the # are ignored.
  2. Any line containing all capital letters and ending in : indicates a section header. The sections recognized are: TEMPLATES, SPECIAL-TEMPLATES, OPTIONS, VARIABLES, and DEFINITIONS. The first three sections (TEMPLATES, SPECIAL-TEMPLATES, and OPTIONS) use template syntax, while the last two ( VARIABLES and DEFINITIONS) use Python syntax.
  3. In the TEMPLATES, SPECIAL-TEMPLATES, or OPTIONS sections of the file, any line ending in an ellipsis (...) means that the following line is a continuation. Thus, the following line is appended to the current one.
  4. For each var = definition pair found in the VARIABLES section of the file, the code creates a new entry in the user_variables dictionary, with value equal to the given definition.
  5. For each entrytype = template pair found in the TEMPLATES section of the file, the code creates a corresponding entry in bstdict, with the key given by the entrytype and value given by the template. The code next examines the template definition to see if it contains a nested options block. If so, it adds it to the list of nested templates.
  6. For each keyword = value pair found in the OPTIONS section of the file, the code creates a new entry in the options dictionary, with the dictionary key being the keyword itself, and the value copied from the right hand side of the option definition.
  7. For each var = definition pair found in the SPECIAL-TEMPLATES section of the file, the code has to do a little more work than elsewhere. First it creates a new entry in the specials dictionary, with the dictionary key given by the var, and the value given by the definition. It then appens the key to the specials_list. (Since a dictionary is not ordered, we need an order-preserving means of iterating through the list of specials to make sure that one can always be defined before another that depends on it.) Next it examines the template definition to see if it contains a nested options block. If so, it adds it to the list of nested templates. It also looks to see if there is an ellipsis representing an implicit loop. If so, it adds the template key to the list of “looped templates”. Finally it looks to see if the template’s key represents an inmplicitly-indexed variable. If so, it adds the key to the list of implicitly indexed variables.

Once the initial parsing is done, there are several steps in which it analyzes the results:

  1. Iterating through each of the regular templates, the code looks to see if any of the templates are defined as copies of other templates, as, for example, inbook = incollection. If it finds this kind of definition, then it copies the template from the one (incollection here) to the other (inbook here).
  2. The code looks at the functions defined in the DEFINITIONS section of the file. If the allow_scripts keyword is set to True, then it goes ahead and evaluates these function definitions so that they will be available during the process of formatting bibliography entries.
  3. Finally, the code passes each template definition through the validate_templatestr() function to validate that the template has proper syntax.

Writing the BBL file

Now that all the information is available to Bibulous, we can begin writing the output BBL file. First we write a few lines to the preamble, including the preamble string obtained from the .bib database files. Then, for each citation key we found in the .aux file, we

  1. Insert any cross-reference data from any other database entries into the current one.
  2. Define all of the “special variables”, including the sortkey and citelabel, as fields within the current entry.

Now that we have all of the sortkeys, we generate the citation_list — the thing we iterate through one by one to format the references in order. At each iteration, we call format_bibitem(), which does the following:

  1. Write the line \bibitem[citelabel]{citekey} into the .bbl file.
  2. Import the template corresponding to the current entry’s entrytype.
  3. If there are any user-deefined variables (from the VARIABLES section of the file), then evaluate those variables now, so that they can be used inside the template.
  4. For each option block in the template, go through and determine how to “simplify” the block. This amounts to locating the first cell in each block that has a defined value, and then replacing the [...] square-bracket-delimited block with its contents. At this point the template variables are still there; only the square brackets have been dropped.
  5. Now that the optional pieces are all gone, go through each template variable and replace it with the corresponding field from the database entry.
  6. If there are any nested \textit{...\textit{...}...} operators in the result, replace odd-level operators with \textup{...} in order to get the right behavior of flipping between italics and regular font.
  7. If there are any nested \textbf{...\textbf{...}...} operators in the result, replace odd-level operators with \textup{...} in order to get the right behavior of flipping between bold and regular weight.
  8. If there are any nested quotation marks in the result, then re-order them according to the American standard. This means having double-quotation-marks at the outermost level, single-quotation-marks inside that, then double inside that, single inside that, and so on. This is messy and difficult code, and so users should always be recommended to use the \enquote{...} LaTeX operator instead of manually-implemented quotation marks.

Name formatting

One of the more complex tasks needed for parsing BIB files is to resolve the elements of name lists (typically saved in the author and editor fields). In order to know how these should be inserted into a template, it is necessary to know which parts of a given person’s name correspond to the first name, the middle name(s), the “prefix” (or “von part”), the last name (or “surname”), and the “suffix” (such as “Jr.” or “III”). These five pieces or each person’s name are saved as a dictionary, so that a bibliography entry with five authors is represented in <authorlist> as a list of five dictionaries, and each dictionary having keys first, middle, prefix, last, and suffix.

In order to speed up parsing times, the actual mapping of the author or editor fields to authorlist or editorlist is not done until the loop over citation keys performed while writing out the BBL file. The function that product the list-of-dicts parsing result is namestr_to_namedict(namestr).

The default formatting of a namelist into a string to be inserted into the template is performed by format_namelist().

create_namelist()

A BibTeX “name” field can consist of three different formats of names:

  1. A space-separated list: [firstname middlenames suffix lastname]
  2. A two-element comma-separated list: [prefix lastname, firstname middlenames]
  3. A three-element comma-separated list: [prefix lastname, suffix, firstname middlenames]

So, an easy way to separate these three categories is by counting the number of commas that appear. The trickiest part here is that although we can use and as a name separator, we are only allowed to do so if and occurs at the top brace level.

In addition, in order to make name parsing more flexible for nonstandard names, Bibulous adds two more name formats to this list:

  1. A four-element comma-separated list: [firstname, middlenames, prefix, lastname]
  2. A five-element comma-separated list: [firstname, middlenames, prefix, lastname, suffix]

For each name in the field, we parse the name tokens into a dictionary. We then compile all of the dictionaries into a list, ordered by the appearance of the names in the input field.

format_namelist()

Given a namelist (list of dictionaries), we glue the name elements together into a single string, incorporating all of the format options selected by the user in the template file. This includes calls to namedict_to_formatted_namestr(), and to initialize_name() if converting any name tokens to initials.

Generating sortkeys

If the user’s style template file selects the citation order to be citenum or none, then creating the ordered citation list is as simple as listing the citation keys in order of their citation appearance, which was recorded as the value in the citation dictionary. If the user instead chooses the citation order to be citekey, then all that is needed is to sort the citation keys alphabetically. Similar operations follow for the various citation order options, but the difficult lies in correctly sorting in the presence of non-ASCII languages, and especially in the presence of LaTeX markup of non-ASCII names. For a citation sorting order that requires using author names, any LaTeX markup needs to be converted to its Unicode equivalent prior to sorting. Using unicode allows the sorting to be done with any input languages, and allows the sorting order to be locale-dependent.

create_citation_list() is the highest-level function for generating the citation list. For each citation key, it calls generate_sortkey(), which is the workhorse function for including all of the various options when generating the key to use for sorting the list. A key part of the function is a call to purify_string(), which removes unnecessary LaTeX markup elements and then calls latex_to_utf8() to convert LaTeX-markup non-ASCII characters to Unicode. It is only after all of these conversions that the final sorting is performed and the sorted citation list returned.

Testing

The suite of regression tests for Bibulous consist of various template definitions and database entries designed to test individual features of the program. The basic approach of the tests is as follows:

  1. Once a change is made to the code (to fix a bug or add functionality), the developer also adds an entry to the test/test1.bib file, where the entry’s “entrytype” is named in such a way to give an indication of what the test is for. For example, the entry in the BIB file may be defined with:

    @initialize1{...
    

    where the developer provides an author field in the entry where one or more authors have names which are difficult to for generating initials correctly. The developer should also include at least a 1-line comment about the purpose of the entry as well. To make everything easy to find, use the entrytype as the entry’s key as well. Thus, the example above would use:

    @initialize1{initialize1, ...
    
  2. If the above new entry is something which can be checked with normal options settings, then the developer should add a corresponding line in the BST file defining how that new entrytype (i.e. initialize1) should be formatted. If different options settings are needed, then a new BST file is needed. Only a minimalist file is generally needed: the file can, for example, contain one line defining a new entrytype and one line to define the new option setting. You can define all of the other options if you want, but these are redundant and introduce a number of unnecessary “overwriting option value…” warning messages.

  3. Next, the developer should add a line \citation{entrytype} to the AUX file where the entrytype is the key given in the new entry of the BIB file you just put in (e.g. initialize1). This is the same as the entrytype to keep everything consistent.

  4. Next, the developer needs to add two lines to the test1_target.bbl file to say what the formatted result should look like. Take a look at other lines to get a feel for how these should look, and take in consideration the form of the template just added to the BST file.

  5. Finally, run bibulous_test.py to check the result. This script will load the modified BIB and BST files and will write out several formatted BBL file test1.bbl etc. It will then run a diff program on the output file versus the target BBL file to see if there are any differences between the target and actual output BBL files.

Generating the documentation

The documentation is written in reStructuredText (RST) and converted to HTML using Sphinx. Sphinx can also use LaTeX to convert the HTML files into a PDF.

From the bibulous repository doc/ subfolder, run make html to generate the HTML documentation. The result can be found in doc/_build/html/, with index.html as the main file. To generate the PDF documentation, run make latexpdf from the doc/ subfolder, with the result found at doc/_build/latex/Bibulous.pdf.

While the documentation is saved in the doc/ folder on the main branch, this is not automatically converted into viewable, linked HTML on GitHub. To achieve that requires pushing the updated docs into the gh-pages branch. One way of doing this is the following. Make a local copy of the main branch’s doc/_build/html/ folder. Switch to the gh-pages branch (i.e. git checkout gh-pages) and replace everything there with the locally-copied doc/_build/html/ folder contents. Then update: git add -A and git push origin gh-pages. And switch back to the main branch, git checkout master.

Updating the PyPI package

From the bibulous base folder, run:

python setup.py sdist --formats=gztar,zip

to create the package locally, and then run:

python setup.py sdist upload

to update the PIP package online.

Miscellaneous notes

The code includes two different variables, citekey and entrykey which for any given entry are always identical. So it would appear that they are redundant. But the keys in the citedict dictionary, and the keys specifying each entry in the database, belong to different sets. That is, the list of entry keys can be from every entry in the database, even entries that were not cited. The list of citation keys, however, contains only those keys that were cited, and so can be a much smaller list.