Sorting bibliographies with special characters #320

kchalipa · 2018-02-01T04:52:57Z

Hi all,

I've run into a challenge with pandoc-citeproc regarding the way it sorts special characters. It seems that any special character (even common glyphs like à) or diacritic is sorted after the letter Z, which can cause major sorting problems in large bibliographies. I posted an example of this on StackExchange; basically what you'll see is that a name like Ábe will be sorted after a name like Aze. This strikes me as an issue that would affect many writers who rely on special characters or diacritics in their work--is there any way to resolve this?

As a further note, I will point out that different authors may have different sorting criteria that are hard to cover in a one-size-fits-all approach. For example, I might want to order my bibliography "San, Šen, Son, Šun", while another author would want to treat the Š as its own letter (San, Son, Šen, Šun). In addition, there are disciplines that use extra characters like ʿ or ʾ that are typically ignored altogether in alphabetical sorting, but are currently sent to the end of the alphabet. Given how idiosyncratic these scenarios can get, the best solution that I can think of is to include some kind of sort key that lets me tell the parser where to put the entry; so if I have a citation like "al-ʿUdhrī" I could tell it to be sorted not under "a" nor under "z" but under "u", as though it were "Udhri". Such solutions have been devised for BibTeX; is there, or could there be, something similar with pandoc-citeproc?

Sorry if this has already been answered elsewhere and I failed to find it.

jgm · 2018-02-01T18:29:06Z

I'm not sure what to do about this. There are different ways to sort, as you note, and it's hard to see how to incorporate an option for this.

You could try compiling pandoc-citeproc with the unicode_collation flag to see if that gives you different results. (requires the icu library)

kchalipa · 2018-02-03T02:32:31Z

Thanks! I'll look into it and report back about what I find.

I see what you mean about there being no easy option; looking through the many fields available in the CSL editor, I didn't see anything to match the sortname or sortkey fields provided by biblatex (although I'm no expert on CSL, or any of this really, so there could well be something I don't know about). One partial workaround—if converting to html, rtf, docx, or the like—is simply to sort the bibliography with another program in the final stage of preparation, although that does require turning off the em-dashes below repeated authors (not sure how to do that) and could introduce new problems with non-dropping particles; but it's a start.

njbart · 2018-02-03T08:27:34Z

Using ICU is almost certainly the way to go here. This used to work well a few years back, though I stopped including it when compiling pandoc-citeproc when I switched from cabal to stack.

ICU can do language (or rather, locale) sensitive collation. (For example, if set to es-ES (Spanish) it collates ñ as a separate letter after n.)

As to “extra characters”, these don’t seem to be displayed properly in the post above, so I can’t say anything on these.

Collating “al-ʿUdhrī” under “U” merely requires setting demote-non-dropping-particle to sort-only in the CSL style file. (Edit: Depending on how the “ʿ” is dealt with, though.)

And yes, there is nothing like the biblatex sortname or sortkey fields in CSL. (In a pinch, you could borrow an otherwise unused CSL variable, and use it as a sortname field, but you’d probably have to populate it for all entries, not just, as in biblatex, the problematic ones.)

@jgm – How exactly would I compile pandoc-citeproc with the unicode_collation flag, using stack?

kchalipa · 2018-02-03T15:31:49Z

Thanks for the further comments! Just to clarify my extra characters, I posted two (small) glyphs that are used in Arabic transliteration, ayn (ʿ) and hamza (ʾ) --- U02BF and U02BE respectively. Since they fall outside the normal set of Latin characters, they are typically ignored in sorting/parsing; some programs do this, but the vast majority (including citeproc, from what I've seen) sort them at the end of the alphabet. Because of the ayn, the example we're discussing, al-ʿUdhrī, would get sorted after Z, which may just be how the cookie crumbles, unless we used the unused-variable trick or compiling with ICU works out. I'll note that when I used biblatex for my bibliographies, it did handle these characters very well (by ignoring them), so maybe we can be optimistic!

jgm · 2018-02-03T19:26:46Z

+++ Nick Bart [Feb 03 18 08:27 ]:

***@***.*** – How exactly would I compile pandoc-citeproc with the unicode_collation flag, using stack?

stack install pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"

njbart · 2018-02-04T10:07:42Z

I’m a little confused now – I thought we had this issue sorted out in #122.

I am currently installing the dev version using the following script (based on the instructions from the pandoc wiki; I’ve been using this script for a while, it’s only pandoc-citeproc that gets a special treatment now – comments, e.g., on whether this is the optimal sequence, are welcome):

buildhome=/Users/nick/src/pandoc-build

for package in cmark-hs pandoc-types texmath zip-archive pandoc
do
printf "\n$package ...\n"
cd $buildhome/$package
git pull
stack install
done

for package in pandoc-citeproc
do
printf "\n$package ...\n"
cd $buildhome/$package
git pull
stack install --flag "pandoc:unicode_collation"
done

Now, the following script, which tries both the latest dev version (first in my path) and the homebrew version (at /usr/local/bin), for the two locales en-US and da-DK …

cat > test.yaml << EOT
---
nocite: '@*'
references:
- id: item1
  type: book
  author:
  - family: Author  
- id: item2
  type: book
  author:
  - family: Zuthor
- id: item3
  type: book
  author:
  - family: Øøthor
...
EOT

printf "\nlatest dev -- en-US:\n\n"
printf "\n---\nnocite: '@*'\n..." | pandoc -s -F pandoc-citeproc -t plain --biblio test.yaml -M lang='en-US'
printf "\nlatest dev -- da-DK:\n\n"
printf "\n---\nnocite: '@*'\n..." | pandoc -s -F pandoc-citeproc -t plain --biblio test.yaml -M lang='da-DK'

printf "\nlatest homebrew -- en-US:\n\n"
printf "\n---\nnocite: '@*'\n..." | /usr/local/bin/pandoc -s -F /usr/local/bin/pandoc-citeproc -t plain --biblio test.yaml -M lang='en-US'
printf "\nlatest homebrew -- da-DK:\n\n"
printf "\n---\nnocite: '@*'\n..." | /usr/local/bin/pandoc -s -F /usr/local/bin/pandoc-citeproc -t plain --biblio test.yaml -M lang='da-DK'

outputs this (some newlines removed):

latest dev -- en-US:
Author. n.d.
Zuthor. n.d.
Øøthor. n.d.

latest dev -- da-DK:
Author. u.å.
Zuthor. u.å.
Øøthor. u.å.

latest homebrew -- en-US:
Author. n.d.
Zuthor. n.d.
Øøthor. n.d.

latest homebrew -- da-DK:
Author. u.å.
Zuthor. u.å.
Øøthor. u.å.

… so no difference at all. For da-DK, this collation seems correct (Ø after Z), but for en-US, it is not (CMoS 16e, 16.67, “Alphabetizing accented letters” quite clearly calls for disregarding any accents, so we should have A – Ø – Z here).

lang seems to be parsed correctly, as judged by the different abbreviations for “no date”.

Just in case, I’m using macOS 10.13.3, and this is the output I get from locale (possibly relevant, again, see #122):

LANG="en_GB.UTF-8"
LC_COLLATE="en_GB.UTF-8"
LC_CTYPE="en_GB.UTF-8"
LC_MESSAGES="en_GB.UTF-8"
LC_MONETARY="en_GB.UTF-8"
LC_NUMERIC="en_GB.UTF-8"
LC_TIME="en_GB.UTF-8"
LC_ALL=

Any ideas?

jgm · 2018-02-06T00:46:51Z

I'm sorry, the wiki needed updating. (See current version.) What you really want to do is just 1) close pandoc repository, cd to it 2) stack setup 3) stack install pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"

njbart · 2018-02-06T08:10:35Z

Ok, this seems to have worked. (Should we add the complete command I had to use, stack install pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation" --extra-lib-dirs=/usr/local/opt/icu4c/lib --extra-include-dirs=/usr/local/opt/icu4c/include, to the wiki?)

Now I see a new issue: It seems correct collation of accented characters in English is only available if ICU is used, so the question is, could it be included by default?

EDIT: Removed a report on sorting of “Aa” – it seems in Danish, “Aa” should be sorted just like “Å”.

njbart · 2018-02-06T08:24:28Z

As to special characters such as ayn and hamza, it seems these are currently sorted after “Z” in en-US and a few other locales I tried, even when using ICU. (Other special characters, e.g., a straight apostrophe at the start of a name, do not influence sorting, so ignoring specific characters certainly does seem to be possible.)

ICU does seem to provide a mechanism for customizing collation, so there might be a possibility to make this user-configurable, e.g., via a pandoc-citeproc flag.

njbart · 2018-02-06T11:12:00Z

More data points:

Apostrophes inside a name (e.g., O’Neill) are not ignored when collating with ICU and en-US – though they should:

pandoc -s -F pandoc-citeproc -t plain << EOT

Foo [@item1; @item2; @item3].

---
lang: en-US
references:
- id: item1
  type: book
  author:
  - family: Onassis  
    given: Aristotle
  issued:
  - year: 2013
  title: Title
- id: item2
  type: book
  author:
  - family: O'Neill
    given: Eugene
  issued:
  - year: 2013
  title: Title
- id: item3
  type: book
  author:
  - family: Ongaro
    given: Francesco
    dropping-particle: dall'
  issued:
  - year: 2013
  title: Title
...
EOT

Output (some newlines removed):

Foo (Onassis 2013; O’Neill 2013; Ongaro 2013).

O’Neill, Eugene. 2013. _Title_.
Onassis, Aristotle. 2013. _Title_.
Ongaro, Francesco dall’. 2013. _Title_.

Expected, according to the example in CMS 16e, 16.74:

Onassis, Aristotle. 2013. _Title_.
O’Neill, Eugene. 2013. _Title_.
Ongaro, Francesco dall’. 2013. _Title_.

Zotero/citeproc-js/LO get this one right, BTW.

jgm · 2018-02-06T18:42:46Z

Perhaps the best place for a note about the proper stack invocation for installing with ICU support would be the README for this repository. Note that the library locations may vary depending on the system.

We use the rfc5051 package for sorting if ICU isn't enabled. This gets some default sorting right for English and western European languages, but not all.

The apostrophe issue should not be too hard to deal with, I think.

As for ICU collation customization, one would first have to investigate how this is exposed in the haskell icu package.

jgm · 2018-02-06T18:48:15Z

OK, with b2bd8ec I get

Foo (Onassis 2013; O’Neill 2013; Ongaro 2013).

Onassis, Aristotle. 2013. _Title_.

Ongaro, Francesco dall’. 2013. _Title_.

O’Neill, Eugene. 2013. _Title_.

Should we also be lowercasing everything before comparing?
(PS. This is without ICU -- I don't have ICU on my system.)

njbart · 2018-02-06T22:22:51Z

But your non-ICU output still does not conform with CMS: the order should be Onassis – O’Neill – Ongaro. (BTW, APA has the same rule: “Disregard the apostrophe”, APA Manual 6e, 6.25.)

Should we also be lowercasing everything before comparing?

I would guess so – after all, CMS (16e, 16.71) lists the following, which clearly implies case-independent sorting:

Beauvoir, Simone de
Ben-Gurion, David
Costa, Uriel da
da Cunha, Euclides
D'Amato, Alfonse
de Gaulle, Charles
di Leonardo, Micaela

Since you seem to have decided to have pandoc-citeproc deal with the apostrophe issue itself rather than relying on a library, would it be too difficult to simply strip out ayn and hamza, too, before sorting? (CMS 16e, 11.96 ff. mentions ayn and hamza, but does not say anything explicit on whether they should be disregarded when sorting. The OP seems to think so, and I tend to agree.)

Making this user-configurable would be even better, but so far I haven’t found any hint either how to access the customization mechanism in the haskell icu package.

jgm · 2018-02-07T00:07:07Z

+++ Nick Bart [Feb 06 18 22:22 ]:

But your non-ICU output still does not conform with CMS: the order should be Onassis – O’Neill – Ongaro. (BTW, APA has the same rule: “Disregard the apostrophe”, APA Manual 6e, 6.25.)

Yes, I know. This is due to the case sensitivity issue.

njbart · 2018-02-07T08:02:47Z

Yes, I know. This is due to the case sensitivity issue.

I see. So that’s another indication that CMS requires case-insensitive sorting.

APA requires this as well; in addition, it seems that for APA spaces and hyphens should be disregarded, too. (As they put it: “Alphabetize letter by letter. When alphabetizing surnames, remember that “nothing precedes something”: Brown, J. R., precedes Browning, A. R., even though i precedes j in the alphabet.” APA Manual, 6e, 6.25.) I couldn’t find anything in CMS so far that would contradict this, so it’d be probably worth implementing this rule, too.

Test case:

pandoc -s -F pandoc-citeproc -t plain << EOT

Expected (according to APA Manual, 6e, 6.25):  
Benjamin, A. S.  
ben Yaakov, D.  
Brown, J. R.  
Browning, A. R.  
Girard, J.-B.  
Girard-Perregaux, A. S.   
Ibn Abdulaziz, T.  
Ibn Nidal, A. K. M.   
López, M. E.  
López de Molina, G.   
Singh, Y.  
Singh Siddhu, N.  
Villafuerte, S. A.  
Villa-Lobos, J. 

Foo [@item1; @item2; @item3; @item4; @item5; @item6; @item7; @item8; @item9; @item10; @item11; @item12; @item13; @item14].

---
csl: apa.csl
references:
- id: item1
  author:
  - family: Benjamin, A. S.
- id: item2
  author:
  - family: ben Yaakov, D.
- id: item3
  author:
  - family: Brown, J. R.
- id: item4
  author:
  - family: Browning, A. R.
- id: item5
  author:
  - family: Girard, J.-B.
- id: item6
  author:
  - family: Girard-Perregaux, A. S. 
- id: item7
  author:
  - family: Ibn Abdulaziz, T.
- id: item8
  author:
  - family: Ibn Nidal, A. K. M. 
- id: item9
  author:
  - family: López, M. E.
- id: item10
  author:
  - family: López de Molina, G. 
- id: item11
  author:
  - family: Singh, Y.
- id: item12
  author:
  - family: Singh Siddhu, N.
- id: item13
  author:
  - family: Villafuerte, S. A.
- id: item14
  author:
  - family: Villa-Lobos, J. 
...
EOT

Actual output, with ICU, only the reference list shown:

ben Yaakov, D. (n.d.).
Benjamin, A. S. (n.d.).
Brown, J. R. (n.d.).
Browning, A. R. (n.d.).
Girard-Perregaux, A. S. (n.d.).
Girard, J.-B. (n.d.).
Ibn Abdulaziz, T. (n.d.).
Ibn Nidal, A. K. M. (n.d.).
López de Molina, G. (n.d.).
López, M. E. (n.d.).
Singh Siddhu, N. (n.d.).
Singh, Y. (n.d.).
Villa-Lobos, J. (n.d.).
Villafuerte, S. A. (n.d.).

This shows that the sorting of all pairs, except “Brown/Browning” and “Ibn …/Ibn …” are currently wrong according to APA. Same actual sort order in the output with chicago-author-date.csl, BTW.

njbart · 2018-02-07T11:28:58Z

Ok, once again, this time names properly split into family/given/particles:

pandoc -s -F pandoc-citeproc -t plain << EOT

Expected (according to APA Manual, 6e, 6.25):  
Benjamin, A. S.  
ben Yaakov, D.  
Brown, J. R.  
Browning, A. R.  
Girard, J.-B.  
Girard-Perregaux, A. S.  
Ibn Abdulaziz, T.  
Ibn Nidal, A. K. M.  
López, M. E.  
López de Molina, G.  
Singh, Y.  
Singh Siddhu, N.  
Villafuerte, S. A.  
Villa-Lobos, J.

Foo [@item1; @item2; @item3; @item4; @item5; @item6; @item7; @item8; @item9; @item10; @item11; @item12; @item13; @item14].

---
csl: apa.csl
references:
- id: item1
  author:
  - family: Benjamin
    given: A. S.
- id: item2
  author:
  - family: Yaakov
    non-dropping-particle: ben
    given: D.
- id: item3
  author:
  - family: Brown
    given: J. R.
- id: item4
  author:
  - family: Browning
    given: A. R.
- id: item5
  author:
  - family: Girard
    given: J.-B.
- id: item6
  author:
  - family: Girard-Perregaux
    given: A. S.
- id: item7
  author:
  - family: Ibn Abdulaziz
    given: T.
- id: item8
  author:
  - family: Ibn Nidal
    given: A. K. M.
- id: item9
  author:
  - family: López
    given: M. E.
- id: item10
  author:
  - family: López de Molina
    given: G.
- id: item11
  author:
  - family: Singh
    given: Y.
- id: item12
  author:
  - family: Singh Siddhu
    given: N.
- id: item13
  author:
  - family: Villafuerte
    given: S. A.
- id: item14
  author:
  - family: Villa-Lobos
    given: J.
...
EOT

Output, with ICU:

ben Yaakov, D. (n.d.).
Benjamin, A. S. (n.d.).
Brown, J. R. (n.d.).
Browning, A. R. (n.d.).
Girard, J.-B. (n.d.).
Girard-Perregaux, A. S. (n.d.).
Ibn Abdulaziz, T. (n.d.).
Ibn Nidal, A. K. M. (n.d.).
López de Molina, G. (n.d.).
López, M. E. (n.d.).
Singh Siddhu, N. (n.d.).
Singh, Y. (n.d.).
Villa-Lobos, J. (n.d.).
Villafuerte, S. A. (n.d.).

Interestingly, using the proper family/given structure reverses the sort order of “Girard …”, so this looks ok now, too.

njbart · 2018-02-07T12:16:08Z

Two more test cases, from CMS 16e, 16.73 and .75 (with ICU, the “Mac”s look ok, the “Saint”s don’t):

pandoc -s -F pandoc-citeproc -t plain << EOT

From CMS 16e 16.73. Expected:

Macalister, Donald  
MacAlister, Paul  
Macauley, Catharine  
Macmillan, Harold  
Madison, James  
McAllister, Ward  
McAuley, Catherine  
McMillan, Edwin M.

Foo [@item1; @item2; @item3; @item4; @item5; @item6; @item7; @item8].

---
references:
- id: item1
  author:
  - family: Macalister
    given: Donald
- id: item2
  author:
  - family: MacAlister
    given: Paul
- id: item3
  author:
  - family: Macauley
    given: Catharine
- id: item4
  author:
  - family: Macmillan
    given: Harold
- id: item5
  author:
  - family: Madison
    given: James
- id: item6
  author:
  - family: McAllister
    given: Ward
- id: item7
  author:
  - family: McAuley
    given: Catherine
- id: item8
  author:
  - family: McMillan
    given: Edwin M.
...
EOT

Current output, with ICU:

Macalister, Donald. n.d.
MacAlister, Paul. n.d.
Macauley, Catharine. n.d.
Macmillan, Harold. n.d.
Madison, James. n.d.
McAllister, Ward. n.d.
McAuley, Catherine. n.d.
McMillan, Edwin M. n.d.

pandoc -s -F pandoc-citeproc -t plain << EOT

From CMS 16e 16.75. Expected:

Sainte-Beuve, Charles-Augustin  
Saint-Gaudens, Augustus  
Saint-Saëns, Camille  
San Martin, José de  
St. Denis, Ruth  
St. Laurent, Louis Stephen

Foo [@item1; @item2; @item3; @item4; @item5; @item6].

---
references:
- id: item1
  author:
  - family: Sainte-Beuve
    given: Charles-Augustin
- id: item2
  author:
  - family: Saint-Gaudens
    given: Augustus
- id: item3
  author:
  - family: Saint-Saëns
    given: Camille
- id: item4
  author:
  - family: San Martin
    given: José
    dropping-particle: de
- id: item5
  author:
  - family: St. Denis
    given: Ruth
- id: item6
  author:
  - family: St. Laurent
    given: Louis Stephen
...
EOT

Current output, with ICU:

Saint-Gaudens, Augustus. n.d.
Saint-Saëns, Camille. n.d.
Sainte-Beuve, Charles-Augustin. n.d.
San Martin, José de. n.d.
St. Denis, Ruth. n.d.
St. Laurent, Louis Stephen. n.d.

jgm · 2018-02-07T20:44:03Z

Can you build with commit 4406196 and try your tests again? This makes sorting case-insensitive. It would be good to have a list of problems that remain.

njbart · 2018-02-07T22:45:39Z

Done.

Onassis – O’Neill – Ongaro seems ok now.
the “Mac”s seem ok, too, as they did before

Still problematic:

Saint-Gaudens and Saint-Saëns before Sainte-Beuve (should be Sainte-Beuve – Saint-Gaudens – Saint-Saëns)
most APA examples, as reported above (all, it seems, having to do with spaces and hyphens)

njbart · 2018-02-08T07:59:47Z

Wait, now I’m seeing the wrong “O’Neill – Onassis – Ongaro” again (in the same terminal session in which I earlier got “Onassis – O’Neill – Ongaro”, still visible when scrolling up), without any apparent changes to pandoc-citeproc or the test script. Puzzling. Will try to figure out what happened here.

njbart · 2018-02-08T13:18:27Z

Ok, went back to doing git pull and compiling pandoc-citeproc from within the pandoc-citeproc dir. Not sure whether there’s any connection, but at least now I consistently get the correct “Onassis – O’Neill – Ongaro”. The other issues listed under “Still problematic” persist.

njbart · 2018-02-08T17:58:30Z

The MLA Handbook, 8e, 2.7.1, is pretty specific on “punctuation marks and spaces” as well as “accents and other diacritical marks”:

2.7.1 LETTER-BY-LETTER ALPHABETIZATION

The alphabetical ordering of entries that begin with authors' names is determined by the letters that come before the commas separating the authors' last and first names. Other punctuation marks and spaces are ignored. The letters following the commas are considered only when two or more, last names are identical.

Descartes, René
De Sica, Vittorio
MacDonald, George
McCullers, Carson
Morris, Robert
Morris, William
Morrison, Toni
Saint-Exupéry, Antoine de
St. Denis, Ruth

Accents and other diacritical marks should be ignored in alphabetization: for example, é is treated the same as e. Special characters, such as @ in an online username, are also ignored.

Though APA and CMS are silent on some of these, they don’t seem to contradict anything in this passage from MLA. In any case, both APA and CMS call for removing apostrophes, spaces, hyphens, and abbreviation dots.

In addition, specifically for Arabic, there’s this piece of advice:

Diacritics, and especially the letter ‘ayn, can occur at the start of a word or name. It is correct style for sorting purposes to disregard any diacritics based on Arabic transliterations.
(Hedden, Heather. 2007. “Arabic Names”. The Indexer 25 (3): 9C–15C. http://www.ingentaconnect.com/content/index/tiji/2007/00000025/00000003/art00025.)

All in all, I think a case could be made for disregarding all spaces, punctuation marks, and (standalone) diacritics for sorting purposes. (Accented characters, however, still have to be sorted in a locale-dependent fashion.) What I’m not sure about is whether ICU provides any mechanisms for disregarding spaces, punctuation marks, and (standalone) diacritics.

jgm · 2018-02-08T19:37:17Z

Oh, I just realized that the way this is set up, the current locale (the locale in which the program is run, which may be different from the locale specified by the style) is always used for ICU collation settings. That's not great, but won't be easy to change.

jgm · 2018-02-08T20:33:01Z

OK, I've got a fix that works with all of your examples.
I haven't tried the Arabic cases, because I don't know how to set them up, but feel free to test further!

jgm · 2018-02-08T20:33:13Z

By the way, these tests all work without icu.

njbart · 2018-02-09T09:25:56Z

Ok, my (five) tests based on APA, CMS (O’/Mac/St.), and MLA go green, both with and without icu.

What’s not ideal, though, is that the non-icu version still fails to correctly collate accented (and similar) characters: Both en-US and da-DK output “Author – Zuthor – Øøthor”, which happens to be correct in Danish, but is incorrect in English, according to both CMS and MLA (APA is silent on this).

I know that including icu means a noticeable overhead, but if rfc5051 neither allows locale-sensitive collation nor gets the collation of accented characters in English right, icu might turn out to be the preferable default.

njbart · 2018-02-09T10:14:43Z

Test case for transliterated Arabic:

pandoc -s -F pandoc-citeproc -t plain << EOT

Expected: ʾUdhrī (w/ ayn) and ʿUdhrī (w/ hamza) between Uch and Uebel
(i.e., ayn/hamza disregarded when collating)

Foo [@item1; @item2; @item3; @item4; @item5; @item6; @item7; @item8].

---
references:
- id: item1
  author:
  - family: ʾUdhrī
    non-dropping-particle: al-
    given: Jamīl
    note: ayn
- id: item2
  author:
  - family: ʿUdhrī
    non-dropping-particle: al-
    given: Jamīl
    note: hamza
- id: item3
  author:
  - family: \'Udhrī
    non-dropping-particle: al-
    given: Jamīl
    note: straight apostrophe
- id: item4
  author:
  - family: ‘Udhrī
    non-dropping-particle: al-
    given: Jamīl
    note: inverted apostrophe = opening single curly quote (for ayn)
- id: item5
  author:
  - family: ’Udhrī
    non-dropping-particle: al-
    given: Jamīl
    note: apostrophe = closing single curly quote (for hamza)
- id: item6
  author:
  - family: Uch
    given: Ann
- id: item7
  author:
  - family: Uebel
    given: Joe
- id: item8
  author:
  - family: Zzz
    given: Zoe
...
EOT

Current output with and without icu (references only):

Uch, Ann. n.d.
'Udhrī, Jamīl al-. n.d.
‘Udhrī, Jamīl al-. n.d.
’Udhrī, Jamīl al-. n.d.
Uebel, Joe. n.d.
Zzz, Zoe. n.d.
ʾUdhrī, Jamīl al-. n.d.
ʿUdhrī, Jamīl al-. n.d.

jgm · 2018-02-09T19:59:57Z

+++ Nick Bart [Feb 09 18 09:25 ]:

I know that including icu means a noticeable overhead, but if rfc5051 neither allows locale-sensitive collation nor gets the collation of accented characters in English right, icu might turn out to be the preferable default.

The problem is that icu requires a C library, which can be difficult to install on some platforms, so I'd prefer the default install not to require it.

jgm · 2018-02-16T23:40:15Z

I'm reopening until we fix the transliterated arabic case.

kchalipa · 2018-03-06T18:03:07Z

Hi all,

I really appreciate the time and effort you put into resolving this issue! I'm sorry that I took so long to say so---I followed the conversation with great interest---but I was dealing with an xcode glitch on my machine that prevented me from trying out the new implementation for a while, so I didn't have anything to report. I've now installed the latest version of pandoc (2.1.2) and pandoc-citeproc (0.14.1.5) and tried running a sample bibliography; the apostrophes and half ring characters no longer interfere with the sorting, so it looks like everything works great! Thank you!

I just have one follow-up question about installing with ICU, which, as I understand from @njbart, is necessary for the correct collation of accented characters in English. I completely uninstalled pandoc and pandoc-citeproc from my system and followed the steps @jgm gave above (I went to /usr/local/bin and ran stack setup, stack install pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"), but my system can't seem to find it: which pandoc reveals it to be in ~/.local/bin, but running pandoc tells me /usr/local/bin/pandoc: No such file or directory. It seems like I need to point it to the new location, or have stack install in the /usr/local/bin directory. Can you advise how I might do this, or suggest a preferred way to install with ICU? I'm sure it's simple, I just have no background in this stuff.

jgm · 2018-03-06T19:13:34Z

Make sure ~/.local/bin is in your path, before /usr/local/bin. Add to .bashrc: export PATH=$HOME/.local/bin:$PATH +++ kchalipa [Mar 06 18 18:03 ]:

…

Hi all, I really appreciate the time and effort you put into resolving this issue! I'm sorry that I took so long to say so---I followed the conversation with great interest---but I was dealing with an xcode glitch on my machine that prevented me from trying out the new implementation for a while, so I didn't have anything to report. I've now installed the latest version of pandoc (2.1.2) and pandoc-citeproc (0.14.1.5) and tried running a sample bibliography; the apostrophes and half ring characters no longer interfere with the sorting, so it looks like everything works great! Thank you! I just have one follow-up question about installing with ICU, which, as I understand from ***@***.***, is necessary for the correct collation of accented characters in English. I completely uninstalled pandoc and pandoc-citeproc from my system and followed the steps ***@***.*** gave above (I went to /usr/local/bin and ran stack setup, stack install pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"), but my system can't seem to find it: which pandoc reveals it to be in ~/.local/bin, but running pandoc tells me /usr/local/bin/pandoc: No such file or directory. It seems like I need to point it to the new location, or have stack install in the /usr/local/bin directory. Can you advise how I might do this, or suggest a preferred way to install with ICU? I'm sure it's simple, I just have no background in this stuff. — You are receiving this because you were mentioned. Reply to this email directly, [3]view it on GitHub, or [4]mute the thread. References 1. https://github.com/njbart 2. https://github.com/jgm 3. #320 (comment) 4. https://github.com/notifications/unsubscribe-auth/AAAL5AmKMqGgjZ_Fs1UAhBZHQqULbKHyks5tbs9cgaJpZM4R1Df1

kchalipa · 2018-03-06T20:13:15Z

That did it! Thanks so much, John! Pandoc is an amazing piece of software—it's really changed my life as an academic writer—and I appreciate you taking the time to maintain and improve it.

palinurus · 2018-05-12T01:36:19Z

What would be the best way to address this issue as a Windows user who installs pandoc from the .msi? Is compiling pandoc-citeproc with unicode-collation really the only way to get it to correctly sort Péb-- before Pet-- instead of after it? I am happy to engage in ad-hoc tweaks to my bibliography file, as one must from time to time...

(I've tried the various sortkey and noopsort type flags people suggest for bibtex, but none of them have, at least in the past, accomplished anything.)

jgm · 2018-05-12T02:04:49Z

We could look into changing our pandoc appveyor setup (which we use for the Windows binary) so it builds pandoc-citeproc with unicode-collation. I'm no Windows expert, but if someone wants to propose the needed tweaks to appveyor.yaml, I'll consider it. palinurus <notifications@github.com> writes:

…

What would be the best way to address this issue as a Windows user who installs pandoc from the .msi? Is compiling pandoc-citeproc with unicode-collation really the only way to get it to correctly sort Péb-- before Pet-- instead of after it? I am happy to engage in ad-hoc tweaks to my bibliography file, as one must from time to time... -- You are receiving this because you were mentioned. Reply to this email directly or view it on GitHub: #320 (comment)

palinurus · 2018-05-12T22:19:29Z

Thanks for the reply, John! I certainly don't know how to do that, but, FWIW to anyone with this issue in the future, it's not hard to get slack set up on Windows and install the icu libraries required to build pandoc-citeproc as described above. And now my bibliography is correctly sorted!

jgm closed this as completed in aa4265c Feb 8, 2018

jgm reopened this Feb 16, 2018

jgm closed this as completed in 2b82e4b Feb 16, 2018

njbart mentioned this issue Feb 13, 2025

Latex writer: ?‘ produces ?' rather than ?` jgm/pandoc#10610

Closed

Sorting bibliographies with special characters #320

Sorting bibliographies with special characters #320

Comments

kchalipa commented Feb 1, 2018

jgm commented Feb 1, 2018

kchalipa commented Feb 3, 2018

njbart commented Feb 3, 2018 • edited Loading

kchalipa commented Feb 3, 2018

jgm commented Feb 3, 2018 via email

njbart commented Feb 4, 2018 • edited Loading

jgm commented Feb 6, 2018 via email

njbart commented Feb 6, 2018 • edited Loading

njbart commented Feb 6, 2018

njbart commented Feb 6, 2018 • edited Loading

jgm commented Feb 6, 2018

jgm commented Feb 6, 2018 • edited Loading

njbart commented Feb 6, 2018

jgm commented Feb 7, 2018 via email

njbart commented Feb 7, 2018 • edited Loading

njbart commented Feb 7, 2018

njbart commented Feb 7, 2018

jgm commented Feb 7, 2018 via email

njbart commented Feb 7, 2018

njbart commented Feb 8, 2018

njbart commented Feb 8, 2018

njbart commented Feb 8, 2018

jgm commented Feb 8, 2018

jgm commented Feb 8, 2018

jgm commented Feb 8, 2018

njbart commented Feb 9, 2018

njbart commented Feb 9, 2018

jgm commented Feb 9, 2018 via email

jgm commented Feb 16, 2018

kchalipa commented Mar 6, 2018

jgm commented Mar 6, 2018 via email

kchalipa commented Mar 6, 2018

palinurus commented May 12, 2018 • edited Loading

jgm commented May 12, 2018 via email

palinurus commented May 12, 2018

njbart commented Feb 3, 2018 •

edited

Loading

njbart commented Feb 4, 2018 •

edited

Loading

njbart commented Feb 6, 2018 •

edited

Loading

njbart commented Feb 6, 2018 •

edited

Loading

jgm commented Feb 6, 2018 •

edited

Loading

njbart commented Feb 7, 2018 •

edited

Loading

palinurus commented May 12, 2018 •

edited

Loading