-
-
Notifications
You must be signed in to change notification settings - Fork 61
Sorting bibliographies with special characters #320
Comments
I'm not sure what to do about this. There are different ways to sort, as you note, and it's hard to see how to incorporate an option for this. You could try compiling pandoc-citeproc with the |
Thanks! I'll look into it and report back about what I find. I see what you mean about there being no easy option; looking through the many fields available in the CSL editor, I didn't see anything to match the |
Using ICU is almost certainly the way to go here. This used to work well a few years back, though I stopped including it when compiling pandoc-citeproc when I switched from cabal to stack. ICU can do language (or rather, locale) sensitive collation. (For example, if set to As to “extra characters”, these don’t seem to be displayed properly in the post above, so I can’t say anything on these. Collating “al-ʿUdhrī” under “U” merely requires setting And yes, there is nothing like the biblatex @jgm – How exactly would I compile pandoc-citeproc with the |
Thanks for the further comments! Just to clarify my extra characters, I posted two (small) glyphs that are used in Arabic transliteration, ayn (ʿ) and hamza (ʾ) --- U02BF and U02BE respectively. Since they fall outside the normal set of Latin characters, they are typically ignored in sorting/parsing; some programs do this, but the vast majority (including citeproc, from what I've seen) sort them at the end of the alphabet. Because of the ayn, the example we're discussing, al-ʿUdhrī, would get sorted after Z, which may just be how the cookie crumbles, unless we used the unused-variable trick or compiling with ICU works out. I'll note that when I used biblatex for my bibliographies, it did handle these characters very well (by ignoring them), so maybe we can be optimistic! |
+++ Nick Bart [Feb 03 18 08:27 ]:
***@***.*** – How exactly would I compile pandoc-citeproc with the
unicode_collation flag, using stack?
stack install pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"
|
I’m a little confused now – I thought we had this issue sorted out in #122. I am currently installing the dev version using the following script (based on the instructions from the pandoc wiki; I’ve been using this script for a while, it’s only pandoc-citeproc that gets a special treatment now – comments, e.g., on whether this is the optimal sequence, are welcome):
Now, the following script, which tries both the latest dev version (first in my path) and the homebrew version (at /usr/local/bin), for the two locales en-US and da-DK …
outputs this (some newlines removed):
… so no difference at all. For da-DK, this collation seems correct (Ø after Z), but for en-US, it is not (CMoS 16e, 16.67, “Alphabetizing accented letters” quite clearly calls for disregarding any accents, so we should have A – Ø – Z here).
Just in case, I’m using macOS 10.13.3, and this is the output I get from
Any ideas? |
I'm sorry, the wiki needed updating. (See current version.)
What you really want to do is just
1) close pandoc repository, cd to it
2) stack setup
3) stack install pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"
|
Ok, this seems to have worked. (Should we add the complete command I had to use, Now I see a new issue: It seems correct collation of accented characters in English is only available if ICU is used, so the question is, could it be included by default? EDIT: Removed a report on sorting of “Aa” – it seems in Danish, “Aa” should be sorted just like “Å”. |
As to special characters such as ayn and hamza, it seems these are currently sorted after “Z” in en-US and a few other locales I tried, even when using ICU. (Other special characters, e.g., a straight apostrophe at the start of a name, do not influence sorting, so ignoring specific characters certainly does seem to be possible.) ICU does seem to provide a mechanism for customizing collation, so there might be a possibility to make this user-configurable, e.g., via a pandoc-citeproc flag. |
More data points: Apostrophes inside a name (e.g., O’Neill) are not ignored when collating with ICU and
Output (some newlines removed):
Expected, according to the example in CMS 16e, 16.74:
Zotero/citeproc-js/LO get this one right, BTW. |
Perhaps the best place for a note about the proper stack invocation for installing with ICU support would be the README for this repository. Note that the library locations may vary depending on the system. We use the The apostrophe issue should not be too hard to deal with, I think. As for ICU collation customization, one would first have to investigate how this is exposed in the haskell icu package. |
OK, with b2bd8ec I get
Should we also be lowercasing everything before comparing? |
But your non-ICU output still does not conform with CMS: the order should be Onassis – O’Neill – Ongaro. (BTW, APA has the same rule: “Disregard the apostrophe”, APA Manual 6e, 6.25.)
I would guess so – after all, CMS (16e, 16.71) lists the following, which clearly implies case-independent sorting:
Since you seem to have decided to have pandoc-citeproc deal with the apostrophe issue itself rather than relying on a library, would it be too difficult to simply strip out ayn and hamza, too, before sorting? (CMS 16e, 11.96 ff. mentions ayn and hamza, but does not say anything explicit on whether they should be disregarded when sorting. The OP seems to think so, and I tend to agree.) Making this user-configurable would be even better, but so far I haven’t found any hint either how to access the customization mechanism in the haskell icu package. |
+++ Nick Bart [Feb 06 18 22:22 ]:
But your non-ICU output still does not conform with CMS: the order
should be Onassis – O’Neill – Ongaro. (BTW, APA has the same rule:
“Disregard the apostrophe”, APA Manual 6e, 6.25.)
Yes, I know. This is due to the case sensitivity issue.
|
I see. So that’s another indication that CMS requires case-insensitive sorting. APA requires this as well; in addition, it seems that for APA spaces and hyphens should be disregarded, too. (As they put it: “Alphabetize letter by letter. When alphabetizing surnames, remember that “nothing precedes something”: Brown, J. R., precedes Browning, A. R., even though i precedes j in the alphabet.” APA Manual, 6e, 6.25.) I couldn’t find anything in CMS so far that would contradict this, so it’d be probably worth implementing this rule, too. Test case:
Actual output, with ICU, only the reference list shown:
This shows that the sorting of all pairs, except “Brown/Browning” and “Ibn …/Ibn …” are currently wrong according to APA. Same actual sort order in the output with |
Ok, once again, this time names properly split into family/given/particles:
Output, with ICU:
Interestingly, using the proper family/given structure reverses the sort order of “Girard …”, so this looks ok now, too. |
Two more test cases, from CMS 16e, 16.73 and .75 (with ICU, the “Mac”s look ok, the “Saint”s don’t):
Current output, with ICU:
Current output, with ICU:
|
Can you build with commit 4406196
and try your tests again? This makes sorting
case-insensitive. It would be good to have a list of
problems that remain.
|
Done.
Still problematic:
|
Wait, now I’m seeing the wrong “O’Neill – Onassis – Ongaro” again (in the same terminal session in which I earlier got “Onassis – O’Neill – Ongaro”, still visible when scrolling up), without any apparent changes to pandoc-citeproc or the test script. Puzzling. Will try to figure out what happened here. |
Ok, went back to doing |
The MLA Handbook, 8e, 2.7.1, is pretty specific on “punctuation marks and spaces” as well as “accents and other diacritical marks”:
Though APA and CMS are silent on some of these, they don’t seem to contradict anything in this passage from MLA. In any case, both APA and CMS call for removing apostrophes, spaces, hyphens, and abbreviation dots. In addition, specifically for Arabic, there’s this piece of advice:
All in all, I think a case could be made for disregarding all spaces, punctuation marks, and (standalone) diacritics for sorting purposes. (Accented characters, however, still have to be sorted in a locale-dependent fashion.) What I’m not sure about is whether ICU provides any mechanisms for disregarding spaces, punctuation marks, and (standalone) diacritics. |
Oh, I just realized that the way this is set up, the current locale (the locale in which the program is run, which may be different from the locale specified by the style) is always used for ICU collation settings. That's not great, but won't be easy to change. |
OK, I've got a fix that works with all of your examples. |
By the way, these tests all work without icu. |
Ok, my (five) tests based on APA, CMS (O’/Mac/St.), and MLA go green, both with and without icu. What’s not ideal, though, is that the non-icu version still fails to correctly collate accented (and similar) characters: Both I know that including icu means a noticeable overhead, but if rfc5051 neither allows locale-sensitive collation nor gets the collation of accented characters in English right, icu might turn out to be the preferable default. |
Test case for transliterated Arabic:
Current output with and without icu (references only):
|
+++ Nick Bart [Feb 09 18 09:25 ]:
I know that including icu means a noticeable overhead, but if rfc5051
neither allows locale-sensitive collation nor gets the collation of
accented characters in English right, icu might turn out to be the
preferable default.
The problem is that icu requires a C library, which can be
difficult to install on some platforms, so I'd prefer
the default install not to require it.
|
I'm reopening until we fix the transliterated arabic case. |
Hi all, I really appreciate the time and effort you put into resolving this issue! I'm sorry that I took so long to say so---I followed the conversation with great interest---but I was dealing with an xcode glitch on my machine that prevented me from trying out the new implementation for a while, so I didn't have anything to report. I've now installed the latest version of pandoc (2.1.2) and pandoc-citeproc (0.14.1.5) and tried running a sample bibliography; the apostrophes and half ring characters no longer interfere with the sorting, so it looks like everything works great! Thank you! I just have one follow-up question about installing with ICU, which, as I understand from @njbart, is necessary for the correct collation of accented characters in English. I completely uninstalled pandoc and pandoc-citeproc from my system and followed the steps @jgm gave above (I went to |
Make sure ~/.local/bin is in your path, before
/usr/local/bin.
Add to .bashrc:
export PATH=$HOME/.local/bin:$PATH
+++ kchalipa [Mar 06 18 18:03 ]:
… Hi all,
I really appreciate the time and effort you put into resolving this
issue! I'm sorry that I took so long to say so---I followed the
conversation with great interest---but I was dealing with an xcode
glitch on my machine that prevented me from trying out the new
implementation for a while, so I didn't have anything to report. I've
now installed the latest version of pandoc (2.1.2) and pandoc-citeproc
(0.14.1.5) and tried running a sample bibliography; the apostrophes and
half ring characters no longer interfere with the sorting, so it looks
like everything works great! Thank you!
I just have one follow-up question about installing with ICU, which, as
I understand from ***@***.***, is necessary for the correct collation of
accented characters in English. I completely uninstalled pandoc and
pandoc-citeproc from my system and followed the steps ***@***.*** gave
above (I went to /usr/local/bin and ran stack setup, stack install
pandoc pandoc-citeproc --flag "pandoc-citeproc:unicode_collation"), but
my system can't seem to find it: which pandoc reveals it to be in
~/.local/bin, but running pandoc tells me /usr/local/bin/pandoc: No
such file or directory. It seems like I need to point it to the new
location, or have stack install in the /usr/local/bin directory. Can
you advise how I might do this, or suggest a preferred way to install
with ICU? I'm sure it's simple, I just have no background in this
stuff.
—
You are receiving this because you were mentioned.
Reply to this email directly, [3]view it on GitHub, or [4]mute the
thread.
References
1. https://github.com/njbart
2. https://github.com/jgm
3. #320 (comment)
4. https://github.com/notifications/unsubscribe-auth/AAAL5AmKMqGgjZ_Fs1UAhBZHQqULbKHyks5tbs9cgaJpZM4R1Df1
|
That did it! Thanks so much, John! Pandoc is an amazing piece of software—it's really changed my life as an academic writer—and I appreciate you taking the time to maintain and improve it. |
What would be the best way to address this issue as a Windows user who installs pandoc from the .msi? Is compiling pandoc-citeproc with unicode-collation really the only way to get it to correctly sort Péb-- before Pet-- instead of after it? I am happy to engage in ad-hoc tweaks to my bibliography file, as one must from time to time... (I've tried the various sortkey and noopsort type flags people suggest for bibtex, but none of them have, at least in the past, accomplished anything.) |
We could look into changing our pandoc appveyor setup (which we use
for the Windows binary) so it builds pandoc-citeproc with
unicode-collation. I'm no Windows expert, but if someone wants
to propose the needed tweaks to appveyor.yaml, I'll consider it.
palinurus <notifications@github.com> writes:
… What would be the best way to address this issue as a Windows user who installs pandoc from the .msi? Is compiling pandoc-citeproc with unicode-collation really the only way to get it to correctly sort Péb-- before Pet-- instead of after it? I am happy to engage in ad-hoc tweaks to my bibliography file, as one must from time to time...
--
You are receiving this because you were mentioned.
Reply to this email directly or view it on GitHub:
#320 (comment)
|
Thanks for the reply, John! I certainly don't know how to do that, but, FWIW to anyone with this issue in the future, it's not hard to get slack set up on Windows and install the icu libraries required to build pandoc-citeproc as described above. And now my bibliography is correctly sorted! |
Hi all,
I've run into a challenge with pandoc-citeproc regarding the way it sorts special characters. It seems that any special character (even common glyphs like à) or diacritic is sorted after the letter Z, which can cause major sorting problems in large bibliographies. I posted an example of this on StackExchange; basically what you'll see is that a name like Ábe will be sorted after a name like Aze. This strikes me as an issue that would affect many writers who rely on special characters or diacritics in their work--is there any way to resolve this?
As a further note, I will point out that different authors may have different sorting criteria that are hard to cover in a one-size-fits-all approach. For example, I might want to order my bibliography "San, Šen, Son, Šun", while another author would want to treat the Š as its own letter (San, Son, Šen, Šun). In addition, there are disciplines that use extra characters like ʿ or ʾ that are typically ignored altogether in alphabetical sorting, but are currently sent to the end of the alphabet. Given how idiosyncratic these scenarios can get, the best solution that I can think of is to include some kind of sort key that lets me tell the parser where to put the entry; so if I have a citation like "al-ʿUdhrī" I could tell it to be sorted not under "a" nor under "z" but under "u", as though it were "Udhri". Such solutions have been devised for BibTeX; is there, or could there be, something similar with pandoc-citeproc?
Sorry if this has already been answered elsewhere and I failed to find it.
The text was updated successfully, but these errors were encountered: