Developer guide¶
Guidelines and notes for Python coding style¶
Note that one can mix 8-bit Python strings (ASCII text) with UTF-8 encoded text as long as the 8-bit string contains only ASCII characters.
Keep in mind when running into Unicode errors: reading a line of text from a file produces a line of bytes and not characters. To decode the bytes into a string of characters, you need to know the encoding.
There are a couple of minor points where the Bibulous coding standards deviates from Python’s PEP8:
- A line width of 120 is the standard (not 80).
- In general, statements that evaluate to a boolean are placed within parentheses (i.e.
if (a < b):
rather thanif a < b:
).
Many developers prefer to spread out code among a large number of small files, but Bibulous is currently organized in a single large file. This is partly because there is no large block of code that fits by itself so that a separate file makes sense. (Parsing of
.bib
files, for example, only requires a couple hundred lines.
Overall project strategy and code structure¶
The basic function of BibTeX is to accept an .aux
file as input and to produce a .bbl
file as output. The aux
file contains all of the citation information as well as the filenames for the bibliography database file (.bib
) and the style file (.bst
).
The basic program flow is as follows:
- Read the
.aux
file and get the names of the bibliography databases (.bib
files), the style templates (.bst
files) to use, together with the entire set of citations.- Read in the Bibulous style template file as a dictionary (
bstdict
).- If the
use_citeextract
keyword is set to True, and if an “extracted” database file exists, then compare the citations in the extracted database against those in the.aux
file. If there are any differences, then re-extract the database. Otherwise, use the extracted database rather than the full one specified in the.aux
file.- Read in all of the bibliography database files into one long dictionary (
bibdata
), replacing any abbreviations with their full form. In an “extracted” database, all entries are parsed, whereas in any other type of database file, only those entries whose keywords are found in the citation list are actually parsed. All other entries have their data saved as unparsed strings. Cross-referenced data is not yet inserted at this point. That is delayed until the time of writing the BBL file in order to speed up parsing. It is only then that the cross-referenced entries have their data parsed into dictionary form.- Now that all the information is collected, we can generate the
.bbl
file. Create the list of sortkeys, then go through each corresponding citation key in turn, and find the corresponding entry key inbibdata
. If there is crossref data, then fill in missing values here. Also create the “special variables” here. Finally, from the entry type, select a template frombstdict
and begin inserting the variables one-by-one into the template.
Because the .bib
file is highly structured, it is straightforward to write a parser by hand in Python: the parse_bibfile()
method converts the .bib
file contents into a Python dictionary (the Bibdata
class’ bibdata
). The .aux
file is even easier to parse, and the parse_auxfile()
method converts the citation information into the Bibdata
class’ citedict
dictionary.
The Bibdata
class thus holds all relevant information needed to operate on a bibliography and generate the output LaTeX-formatted .bbl
file.
Parsing BIB files¶
parse_bibfile()¶
The strategy for parse_bibfile()
is to find each individual bibliography entry, determine its entry type, and save all of the text between the entry’s opening and closing braces as one long string, to be passed to parse_bibentry()
for further parsing. To gather the entry data string, we first look for a line that starts with @
. On that line, we look for a string after the @
followed by {
, where the string gives the entry type. After we know the entry type, we look for the corresponding closing brace. If we don’t find it on the same line, then we read in the next line, and so forth, concatenating all of the lines into one long “entry string” until we encounter the corresponding closing brace. Once we have this extended “entry string” we feed it to parse_bibentry()
to generate the bibliography data. Once we have come to the end of a given entry, we continue reading down the file looking for the next ‘@’ and so on.
Although this approach effectively means that we have to pass twice through the same data, dealing with brace-matching can otherwise become a mess for the BibTeX format, since it allows nested delimiters, is not directly compatible with regular expressions.
parse_bibentry()¶
Because parse_bibfile()
has already split the data by individual entry, parse_bibentry()
only needs to worry about parsing a single entry, and there are five possible formats for the entry string passed to the function:
- If the entrytype is a
comment
, then skip everything, adding nothing to the database dictionary.- If the entrytype is a
preamble
, then treat the entire entry contents as a single fieldvalue. Append the string onto thepreamble
value in thebibdata
dictionary.- If the entrytype is a
acronym
, then get the entrykey and copy it into thename
field. The remainder of the string is a single field value (the full form of the acronym); copy that into thedescription
field.- If the entrytype is a
string
(i.e. an abbreviation), then there is no entrykey. Get the fieldname (abbreviation key), and the remainder of the string is a single field value (the full form of the abbreviated string). Add this key-value pair to theabbrevs
dictionary.- If the entry is any other type, then get the entrykey, and the remainder of the string is a series of field-value pairs.
Once it determines which of these four options to use, parse_bibentry()
extracts the entry key (if present), it locates each individual field and separates out the string corresponding to the key-value pair for each field. It does not actually parse the individual fields. For that, it loops over each field with a call to parse_bibfield()
to extract the field key-value pairs.
parse_bibfield()¶
parse_bibfield()
is the workhorse function of the BIB parsing. And because of BibTeX’s method for allowing concatenation, use of abbreviation keys, and use of two different types of delimiters ("..."
or {...}
), this function is a little messy. However, for the format of a given field, there are four parsing possibilities:
- If the field begins with a double quote
"
then scan until you find the next unnested"
. Add that to the result string. If the ending"
is followed by a comma, then the field is done; return the result string. If the ending is followed by a#
then expect another field string. Scan for it and append it to the current result string.- If the field begins with
{
then scan until you resolve the brace level. This should be followed by a comma, since no concatenation is allowed for brace-delimited fields. Otherwise issue a syntax error warning.- If the field begins with a
#
(concatenation operator) then skip whitespace to the next character set, where you should expect a quote-delimited field. Append that to the current result string.- If the field begins with anything else, then the substring up until the first whitespace character represents an abbreviation key. Locate it and substitute it in. If you don’t find the key in the
abbrevs
dictionary, give a warning and continue on.
Parsing AUX files¶
The .aux
file contains the filenames of the .bib
database file and the .bst
style template file, as well as the citations. The get_bibfilenames()
method scans through the .aux
file and locates a line with \bibdata{...}
which contains a filename or a comma-delimited list of filenames, giving the database files. Another line with \bibstyle{...}
gives the filename or comma-delimited list of filenames for style templates. The filenames obtained are saved into the filedict
attribute – a dictionary whose keys are the file extensions aux
, bbl
, bib
, bst
, or tex
.
The parse_auxfile()
method makes a second pass through the .aux
file, this time looking for the citation information. (Auxiliary files are generally quite small, so taking multiple passes through them costs very little time.) Each line with \citation{...}
contains a citation key or comma-delimited list of citation keys – each one is added into the citation dictionary (citedict
), with a value corresponding to the citation order.
Parsing BST files¶
Parsing a .bst
file basically involves looking for one of several syntactical structures.
- First, any
#
present in a line indicates a comment. All text following the#
are ignored.- Any line containing all capital letters and ending in
:
indicates a section header. The sections recognized are:TEMPLATES
,SPECIAL-TEMPLATES
,OPTIONS
,VARIABLES
, andDEFINITIONS
. The first three sections (TEMPLATES
,SPECIAL-TEMPLATES
, andOPTIONS
) use template syntax, while the last two (VARIABLES
andDEFINITIONS
) use Python syntax.- In the TEMPLATES, SPECIAL-TEMPLATES, or OPTIONS sections of the file, any line ending in an ellipsis (
...
) means that the following line is a continuation. Thus, the following line is appended to the current one.- For each
var = definition
pair found in theVARIABLES
section of the file, the code creates a new entry in theuser_variables
dictionary, with value equal to the given definition.- For each
entrytype = template
pair found in theTEMPLATES
section of the file, the code creates a corresponding entry inbstdict
, with the key given by theentrytype
and value given by thetemplate
. The code next examines the template definition to see if it contains a nested options block. If so, it adds it to the list of nested templates.- For each
keyword = value
pair found in theOPTIONS
section of the file, the code creates a new entry in theoptions
dictionary, with the dictionary key being the keyword itself, and the value copied from the right hand side of the option definition.- For each
var = definition
pair found in theSPECIAL-TEMPLATES
section of the file, the code has to do a little more work than elsewhere. First it creates a new entry in thespecials
dictionary, with the dictionary key given by thevar
, and the value given by thedefinition
. It then appens the key to thespecials_list
. (Since a dictionary is not ordered, we need an order-preserving means of iterating through the list of specials to make sure that one can always be defined before another that depends on it.) Next it examines the template definition to see if it contains a nested options block. If so, it adds it to the list of nested templates. It also looks to see if there is an ellipsis representing an implicit loop. If so, it adds the template key to the list of “looped templates”. Finally it looks to see if the template’s key represents an inmplicitly-indexed variable. If so, it adds the key to the list of implicitly indexed variables.
Once the initial parsing is done, there are several steps in which it analyzes the results:
- Iterating through each of the regular templates, the code looks to see if any of the templates are defined as copies of other templates, as, for example,
inbook = incollection
. If it finds this kind of definition, then it copies the template from the one (incollection
here) to the other (inbook
here).- The code looks at the functions defined in the
DEFINITIONS
section of the file. If theallow_scripts
keyword is set to True, then it goes ahead and evaluates these function definitions so that they will be available during the process of formatting bibliography entries.- Finally, the code passes each template definition through the
validate_templatestr()
function to validate that the template has proper syntax.
Writing the BBL file¶
Now that all the information is available to Bibulous, we can begin writing the output BBL file. First we write a few lines to the preamble, including the preamble
string obtained from the .bib
database files. Then, for each citation key we found in the .aux
file, we
- Insert any cross-reference data from any other database entries into the current one.
- Define all of the “special variables”, including the
sortkey
andcitelabel
, as fields within the current entry.
Now that we have all of the sortkeys, we generate the citation_list
— the thing we iterate through one by one to format the references in order. At each iteration, we call format_bibitem()
, which does the following:
- Write the line
\bibitem[citelabel]{citekey}
into the.bbl
file.- Import the template corresponding to the current entry’s
entrytype
.- If there are any user-deefined variables (from the
VARIABLES
section of the file), then evaluate those variables now, so that they can be used inside the template.- For each option block in the template, go through and determine how to “simplify” the block. This amounts to locating the first cell in each block that has a defined value, and then replacing the
[...]
square-bracket-delimited block with its contents. At this point the template variables are still there; only the square brackets have been dropped.- Now that the optional pieces are all gone, go through each template variable and replace it with the corresponding field from the database entry.
- If there are any nested
\textit{...\textit{...}...}
operators in the result, replace odd-level operators with\textup{...}
in order to get the right behavior of flipping between italics and regular font.- If there are any nested
\textbf{...\textbf{...}...}
operators in the result, replace odd-level operators with\textup{...}
in order to get the right behavior of flipping between bold and regular weight.- If there are any nested quotation marks in the result, then re-order them according to the American standard. This means having double-quotation-marks at the outermost level, single-quotation-marks inside that, then double inside that, single inside that, and so on. This is messy and difficult code, and so users should always be recommended to use the
\enquote{...}
LaTeX operator instead of manually-implemented quotation marks.
Name formatting¶
One of the more complex tasks needed for parsing BIB files is to resolve the elements of name lists (typically saved in the author
and editor
fields). In order to know how these should be inserted into a template, it is necessary to know which parts of a given person’s name correspond to the first name, the middle name(s), the “prefix” (or “von part”), the last name (or “surname”), and the “suffix” (such as “Jr.” or “III”). These five pieces or each person’s name are saved as a dictionary, so that a bibliography entry with five authors is represented in <authorlist>
as a list of five dictionaries, and each dictionary having keys first
, middle
, prefix
, last
, and suffix
.
In order to speed up parsing times, the actual mapping of the author
or editor
fields to authorlist
or editorlist
is not done until the loop over citation keys performed while writing out the BBL file. The function that product the list-of-dicts parsing result is namestr_to_namedict(namestr)
.
The default formatting of a namelist into a string to be inserted into the template is performed by format_namelist()
.
create_namelist()¶
A BibTeX “name” field can consist of three different formats of names:
- A space-separated list:
[firstname middlenames suffix lastname]
- A two-element comma-separated list:
[prefix lastname, firstname middlenames]
- A three-element comma-separated list:
[prefix lastname, suffix, firstname middlenames]
So, an easy way to separate these three categories is by counting the number of commas that appear. The trickiest part here is that although we can use and
as a name separator, we are only allowed to do so if and
occurs at the top brace level.
In addition, in order to make name parsing more flexible for nonstandard names, Bibulous adds two more name formats to this list:
- A four-element comma-separated list:
[firstname, middlenames, prefix, lastname]
- A five-element comma-separated list:
[firstname, middlenames, prefix, lastname, suffix]
For each name in the field, we parse the name tokens into a dictionary. We then compile all of the dictionaries into a list, ordered by the appearance of the names in the input field.
format_namelist()¶
Given a namelist (list of dictionaries), we glue the name elements together into a single string, incorporating all of the format options selected by the user in the template file. This includes calls to namedict_to_formatted_namestr()
, and to initialize_name()
if converting any name tokens to initials.
Generating sortkeys¶
If the user’s style template file selects the citation order to be citenum
or none
, then creating the ordered citation list is as simple as listing the citation keys in order of their citation appearance, which was recorded as the value in the citation dictionary. If the user instead chooses the citation order to be citekey
, then all that is needed is to sort the citation keys alphabetically. Similar operations follow for the various citation order options, but the difficult lies in correctly sorting in the presence of non-ASCII languages, and especially in the presence of LaTeX markup of non-ASCII names. For a citation sorting order that requires using author names, any LaTeX markup needs to be converted to its Unicode equivalent prior to sorting. Using unicode allows the sorting to be done with any input languages, and allows the sorting order to be locale-dependent.
create_citation_list()
is the highest-level function for generating the citation list. For each citation key, it calls generate_sortkey()
, which is the workhorse function for including all of the various options when generating the key to use for sorting the list. A key part of the function is a call to purify_string()
, which removes unnecessary LaTeX markup elements and then calls latex_to_utf8()
to convert LaTeX-markup non-ASCII characters to Unicode. It is only after all of these conversions that the final sorting is performed and the sorted citation list returned.
Testing¶
The suite of regression tests for Bibulous consist of various template definitions and database entries designed to test individual features of the program. The basic approach of the tests is as follows:
Once a change is made to the code (to fix a bug or add functionality), the developer also adds an entry to the
test/test1.bib
file, where the entry’s “entrytype” is named in such a way to give an indication of what the test is for. For example, the entry in the BIB file may be defined with:@initialize1{...where the developer provides an
author
field in the entry where one or more authors have names which are difficult to for generating initials correctly. The developer should also include at least a 1-line comment about the purpose of the entry as well. To make everything easy to find, use the entrytype as the entry’s key as well. Thus, the example above would use:@initialize1{initialize1, ...If the above new entry is something which can be checked with normal options settings, then the developer should add a corresponding line in the BST file defining how that new entrytype (i.e.
initialize1
) should be formatted. If different options settings are needed, then a new BST file is needed. Only a minimalist file is generally needed: the file can, for example, contain one line defining a new entrytype and one line to define the new option setting. You can define all of the other options if you want, but these are redundant and introduce a number of unnecessary “overwriting option value…” warning messages.Next, the developer should add a line
\citation{entrytype}
to the AUX file where theentrytype
is the key given in the new entry of the BIB file you just put in (e.g.initialize1
). This is the same as the entrytype to keep everything consistent.Next, the developer needs to add two lines to the
test1_target.bbl
file to say what the formatted result should look like. Take a look at other lines to get a feel for how these should look, and take in consideration the form of the template just added to the BST file.Finally, run
bibulous_test.py
to check the result. This script will load the modified BIB and BST files and will write out several formatted BBL filetest1.bbl
etc. It will then run adiff
program on the output file versus the target BBL file to see if there are any differences between the target and actual output BBL files.
Generating the documentation¶
The documentation is written in reStructuredText (RST) and converted to HTML using Sphinx. Sphinx can also use LaTeX to convert the HTML files into a PDF.
From the bibulous repository doc/
subfolder, run make html
to generate the HTML documentation. The result can be found in doc/_build/html/
, with index.html
as the main file. To generate the PDF documentation, run make latexpdf
from the doc/
subfolder, with the result found at doc/_build/latex/Bibulous.pdf
.
While the documentation is saved in the doc/
folder on the main branch, this is not automatically converted into viewable, linked HTML on GitHub. To achieve that requires pushing the updated docs into the gh-pages
branch. One way of doing this is the following. Make a local copy of the main branch’s doc/_build/html/
folder. Switch to the gh-pages
branch (i.e. git checkout gh-pages
) and replace everything there with the locally-copied doc/_build/html/
folder contents. Then update: git add -A
and git push origin gh-pages
. And switch back to the main branch, git checkout master
.
Updating the PyPI package¶
From the bibulous base folder, run:
python setup.py sdist --formats=gztar,zip
to create the package locally, and then run:
python setup.py sdist upload
to update the PIP package online.
Miscellaneous notes¶
The code includes two different variables, citekey
and entrykey
which for any given entry are always identical. So it would appear that they are redundant. But the keys in the citedict
dictionary, and the keys specifying each entry in the database, belong to different sets. That is, the list of entry keys can be from every entry in the database, even entries that were not cited. The list of citation keys, however, contains only those keys that were cited, and so can be a much smaller list.