Skip to content
This repository has been archived by the owner on Apr 30, 2021. It is now read-only.

Sorting in bibliography: lowercase appears after uppercase (“author” after “Zuthor”) #122

Closed
njbart opened this issue Apr 12, 2015 · 17 comments

Comments

@njbart
Copy link
Contributor

njbart commented Apr 12, 2015

This affects both lowercase family names and non-dropping-particles.

Example:

pandoc -s -F pandoc-citeproc -t markdown-citations-markdown_in_html_blocks << EOT

---
references:
- author:
  - family: author
    given: Al
  id: a
  issued:
    date-parts:
    - - 1985
  title: Title
  type: book

- author:
  - family: Author
    given: Al
  id: A
  issued:
    date-parts:
    - - 1985
  title: Title
  type: book

- author:
  - family: zuthor
    given: Zoe
  id: z
  issued:
    date-parts:
    - - 1985
  title: Title
  type: book

- author:
  - family: Zuthor
    given: Zoe
  id: Z
  issued:
    date-parts:
    - - 1985
  title: Title
  type: book

- author:
  - family: Doe
    given: Dan
    non-dropping-particle: de
  id: d
  issued:
    date-parts:
    - - 1985
  title: Title
  type: book
...

# Text

Foo [@A; @a; @d; @Z; @z].

# References

EOT

Expected:

Text
====

Foo (Author 1985; author 1985; de Doe 1985; Zuthor 1985; zuthor 1985).

References {#references .unnumbered}
==========

Author, Al. 1985. *Title*.

author, Al. 1985. *Title*.

de Doe, Dan. 1985. *Title*.

Zuthor, Zoe. 1985. *Title*.

zuthor, Zoe. 1985. *Title*.

Actual:

Text
====

Foo (Author 1985; author 1985; de Doe 1985; Zuthor 1985; zuthor 1985).

References {#references .unnumbered}
==========

Author, Al. 1985. *Title*.

Zuthor, Zoe. 1985. *Title*.

author, Al. 1985. *Title*.

de Doe, Dan. 1985. *Title*.

zuthor, Zoe. 1985. *Title*.
@jgm
Copy link
Owner

jgm commented May 3, 2015

@nickbart1980 - I get the correct output for this test case (so I can't reproduce your "actual" output):

Text
====

Foo (Author 1985; author 1985; de Doe 1985; Zuthor 1985; zuthor 1985).

References {#references .unnumbered}
==========

Author, Al. 1985. *Title*.

author, Al. 1985. *Title*.

de Doe, Dan. 1985. *Title*.

Zuthor, Zoe. 1985. *Title*.

zuthor, Zoe. 1985. *Title*.

Are you perhaps using an older version inadvertently?

@njbart
Copy link
Contributor Author

njbart commented May 3, 2015

No, latest dev version, but I'm usually compiling with cabal install -ftest_citeproc -funicode_collation. Without -funicode_collation, i.e., just cabal install -ftest_citeproc, I get the same result as you do.

@jgm
Copy link
Owner

jgm commented May 3, 2015

I just installed with -funicode_collation. Still couldn't reproduce what you're seeing...
With unicode_collation:

Text
====

Foo (Author [1985](#ref-A); author [1985](#ref-a); de Doe
[1985](#ref-d); Zuthor [1985](#ref-Z); zuthor [1985](#ref-z)).

References {#references .unnumbered}
==========

author, Al. 1985. *Title*.

Author, Al. 1985. *Title*.

de Doe, Dan. 1985. *Title*.

zuthor, Zoe. 1985. *Title*.

Zuthor, Zoe. 1985. *Title*.

@njbart
Copy link
Contributor Author

njbart commented May 4, 2015

That's odd. I’m on MacOS 10.10, and have now upgraded to icu4c 55.1 (homebrew: https://homebrew.bintray.com/bottles/icu4c-55.1.yosemite.bottle.tar.gz), and reinstalled pandoc-citeproc (which in turn reinstalls text-icu-0.7.0.1) with the following:

pandoc-citeproc $ cabal install -ftest_citeproc -funicode_collation --extra-lib-dirs=/usr/local/opt/icu4c/lib --extra-include-dirs=/usr/local/opt/icu4c/include
Resolving dependencies...
In order, the following will be installed:
text-icu-0.7.0.1 (reinstall) changes: text-1.2.0.4 -> 1.1.0.0
pandoc-citeproc-0.7 +unicode_collation +test_citeproc (reinstall) changes:
aeson-pretty-0.7.2 added, attoparsec-0.11.3.4 added, process-1.2.0.0 added,
temporary-1.2.0.3 added, text-icu-0.7.0.1 added
Warning: Note that reinstalls are always dangerous. Continuing anyway...
Configuring text-icu-0.7.0.1...
Building text-icu-0.7.0.1...
Installed text-icu-0.7.0.1
Configuring pandoc-citeproc-0.7...
Building pandoc-citeproc-0.7...
Installed pandoc-citeproc-0.7
Updating documentation index
/Users/nick/Library/Haskell/share/doc/x86_64-osx-ghc-7.8.3/index.html

Output, as before:

Text
====

Foo (Author [1985](#ref-A); author [1985](#ref-a); de Doe
[1985](#ref-d); Zuthor [1985](#ref-Z); zuthor [1985](#ref-z)).

References {#references .unnumbered}
==========

Author, Al. 1985. *Title*.

Zuthor, Zoe. 1985. *Title*.

author, Al. 1985. *Title*.

de Doe, Dan. 1985. *Title*.

zuthor, Zoe. 1985. *Title*.

Anything else I could try?

@jgm
Copy link
Owner

jgm commented May 5, 2015

I'm at a loss. I've tried this with -funicode_collation on both Ubuntu linux and OSX (with icu4c installed via homebrew) , and in both cases I get correct sorting. (I don't know if icu's collation is locale-dependent. What is your locale?)

@njbart
Copy link
Contributor Author

njbart commented May 5, 2015

Excellent guess. – $locale told me:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Running

export LC_ALL=en_US.UTF-8  
export LANG=en_US.UTF-8

resolves the sorting issue, also for accented and other modified characters for which icu4c was originally brought in (see https://code.google.com/p/citeproc-hs/issues/detail?id=64).

Also, export LC_ALL=da_DK.UTF-8; export LANG=da_DK.UTF-8 leads to the correct Danish sorting of “Ø” after “Z”, so this all looks ok.

My follow-up question is, could pandoc be patched to make icu4c use the locale matching the content of the lang variable, or, if this is empty, use en-US as a default?

@jgm
Copy link
Owner

jgm commented May 5, 2015

+++ nickbart1980 [May 05 15 01:45 ]:

Excellent guess. – $locale told me:

LANG=
LC_COLLATE="C"
LC_CTYPE="UTF-8"
LC_MESSAGES="C"
LC_MONETARY="C"
LC_NUMERIC="C"
LC_TIME="C"
LC_ALL=

Running

export LC_ALL=en_US.UTF-8
export LANG=en_US.UTF-8

resolves the sorting issue, also for accented and other modified characters for which icu4c was originally brought in (see https://code.google.com/p/citeproc-hs/issues/detail?id=64).

Also, export LC_ALL=da_DK.UTF-8; export LANG=da_DK.UTF-8 leads to the correct Danish sorting of “Ø” after “Z”, so this all looks ok.

My follow-up question is, could pandoc be patched to make icu4c use the locale matching the content of the lang variable, or, if this is empty, use en-US as a default?

That should be possible. Currently we have

src/Text/CSL/Style.hs line 438:
      comp a b = T.collate (T.collator T.Current) (T.pack a) (T.pack b)

The Current means: use the current locale. Instead of Current, we
could use Locale "en-US" or whatever.

@jgm
Copy link
Owner

jgm commented May 6, 2015

OK, on closer look, it's not easy to make the collation depend on the locale specified in the style.
The problem is that the compare' function, and the Ord instance it is used to define, doesn't have a parameter for style or locale.
I might try leaving it as Current and setting the LANG environment variable from within pandoc-citeproc, depending on the style.

@jgm jgm closed this as completed in 1eedca6 May 6, 2015
@jgm
Copy link
Owner

jgm commented May 6, 2015

That seems to work well. The locale will be set by the locale metadata field, if present, and otherwise the style's own default-locale, or en-US if that isn't present either.

@njbart
Copy link
Contributor Author

njbart commented May 7, 2015

Well, mostly. Locale-specific collation is not set from the locale metadata field yet.
I get the Danish collation, “Ø” after “Z”, only after running export LC_ALL=da_DK.UTF-8 (interestingly enough, not with export LC_COLLATE="da_DK.UTF-8" alone).

@jgm
Copy link
Owner

jgm commented May 7, 2015

Really? It works for me:

---
locale: da_DK
references:
- issued:
  date-parts:
  - - 2005
  author:
  - given: John
    family: Øoe
  id: item1
  title: First book
  type: book
  publisher: Cambridge University Press
  publisher-place: Cambridge
- container-title: Journal of Generic Studies
  issued:
    date-parts:
    - - 2006
  author:
  - given: John
    family: Zoe
  id: item2
  title: Article
  type: article-journal
  volume: '6'
  page: '33-34'
...

@item1
@item2

With this I get Z before Ø.
When I comment out the locale line in the metadata, I get Ø before Z.

@njbart
Copy link
Contributor Author

njbart commented May 7, 2015

Hmm, what's your output when you run the locale command? And shouldn't it be da-DK (with a hyphen) in your example?

@jgm
Copy link
Owner

jgm commented May 7, 2015

My locale is

LANG=en_US.UTF-8
LANGUAGE=
LC_CTYPE="en_US.UTF-8"
LC_NUMERIC="en_US.UTF-8"
LC_TIME="en_US.UTF-8"
LC_COLLATE="en_US.UTF-8"
LC_MONETARY="en_US.UTF-8"
LC_MESSAGES="en_US.UTF-8"
LC_PAPER="en_US.UTF-8"
LC_NAME="en_US.UTF-8"
LC_ADDRESS="en_US.UTF-8"
LC_TELEPHONE="en_US.UTF-8"
LC_MEASUREMENT="en_US.UTF-8"
LC_IDENTIFICATION="en_US.UTF-8"
LC_ALL=

It works either with da-DK or da_DK (pandoc-citeproc treats these the same).

+++ nickbart1980 [May 07 15 10:12 ]:

Hmm, what's your output when you run the locale command? And shouldn't
it be da-DK (with a hyphen) in your example?


Reply to this email directly or [1]view it on GitHub.

References

  1. Sorting in bibliography: lowercase appears after uppercase (“author” after “Zuthor”) #122 (comment)

@njbart
Copy link
Contributor Author

njbart commented May 8, 2015

OK, this seems to be hinging on whether LC_ALL is set or not. If it is, LANG does not seem to override it. If I leave LC_ALL unset, your example works as expected.

@jgm
Copy link
Owner

jgm commented May 8, 2015

Aha. Currently I just set LANG (in the subprocess) according to the
value of the locale metadata, if set. Should I just set LC_ALL
instead? That should be safe, I think.

+++ nickbart1980 [May 08 15 03:58 ]:

OK, this seems to be hinging on whether LC_ALL is set or not. If it is, LANG does not seem to override it. If I leave LC_ALL unset, your example works as expected.


Reply to this email directly or view it on GitHub:
#122 (comment)

@njbart
Copy link
Contributor Author

njbart commented May 8, 2015

Yes, that seems better. In general, the most common approach seems to be to let users set LANG but not LC_ALL, and use the latter in scripts to override the former, if necessary. See, e.g., https://www.gnu.org/software/gettext/manual/html_node/Locale-Environment-Variables.html.

jgm added a commit that referenced this issue May 8, 2015
LC_ALL will override LANG, so if it is set, setting LANG
doesn't affect collation.  See #122.
@jgm
Copy link
Owner

jgm commented May 8, 2015

I've done this.

# for free to subscribe to this conversation on GitHub. Already have an account? #.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants