forked from chaoss/grimoirelab-sortinghat
-
Notifications
You must be signed in to change notification settings - Fork 0
/
NEWS
253 lines (169 loc) · 9.49 KB
/
NEWS
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
# Releases
## Sorting Hat 0.7 - (2018-10-02)
**NOTICE: Database schema generated by SortingHat < 0.7.0 is still
compatible but older versions can have problems inserting UTF-8
characters of 4 bytes.
Python 2.7 is no longer supported.
Please check "Compatibility between versions" section from README.md file.
**
** New features and improvements: **
* Python 2.7 not longer supported
As Python 2.x will not be maintained after 2020, SortingHat is only
compatible with Python >= 3.4.
* Low level API
This API is able to execute basic operations over the database, such
as adding or removing identities or finding entities. All these operations
work within a session. Nothing is stored in the database until the
session is closed. Thus, these functions can be considered as "bricks",
that combined can create high-level functions.
* Storage of UTF-8 4-bytes characters
The default charset of UTF-8 (utf8) in MySQL/MariaDB does not support,
even when they are part of the standard, 4-bytes long characters.
This means characters like emojis or certain chinese characters cannot
be inserted. Usually, identities names or usernames have these types of
characters.
The charset that fully supports UTF-8 is `utf8mb4` using the collation
`utf8mb4_unicode_520_ci`. This collation implements the suggested Unicode
Collation Algorithm (v5.2).
Using `utf8mb4` also implies that the maximum size of char (VARCHAR and
so on) columns is 191. Indexes cannot be larger than that when using
InnoDB engine.
Starting on 0.7 series, SortingHat is using this charset.
* Handle disconnection using pessimistic mode
SQLAlchemy offers a pessimistic mode to handle database disconnection.
Setting `pool_pre_ping` parameter on the database engine will check if
the database connection is still active when a session of the connection
pool is reused. This causes a small hit in the performance but it's worth
it.
* Use a optimistic approach when inserting data
With this optimistic approach, no more queries to check whether an entity
exists on the database are run prior to its insertion.
## Sorting Hat 0.6 - (2018-03-05)
**NOTICE: Database schema generated by SortingHat < 0.6.0 are no longer
compatible. Please check "Compatibility between versions" section from
README.md file**
** New features and improvements: **
* Gender.
Unique identities gender can be set in the profile using the command
`profile` and data will be stored in the table of the same name. This table
adds two new fields: `gender`, a free text field to set the gender
value, and `gender_acc`, to set the accuracy of the gender - in a range
of 1 to 100 - when it is set using automatic options.
The new command `autogender` has also been added. It assigns a gender
to each unique identity using the name of the profile and the information
provided by `http://genderize.io`. Possible values are *male* or *female*.
* Option for reusing a database.
An existing database can be reused when `init` command is called. So far,
when the database was already created, this command raised an exception.
* Version option.
Calling `sortinghat` with the option `-v | --version` prints the version
of `sortinghat` running on the system.
* Tests improvements.
Some minor changes were done in the testing area. The main ones were to
support MariaDB engine and to use a remote testing database.
## Sorting Hat 0.5 - (2017-12-21)
**NOTICE: Database schema generated by SortingHat < 0.5.0 are no longer
compatible. Please check "Compatibility between versions" section from
README.md file**
** New features and improvements: **
* Last modification.
Unique identities and identities log the last time they were modified
by adding, deleting, moving, merging, updating the profile, adding
or removing enrollments operations.
The new `search_last_modified_identities` API function allows to search
for the UUIDs of those identities modified on or after a given date.
* No strict matching option.
This option allows to avoid a rigorous validation of values while
matching identities, for instance, with well formed email addresses
or names with first name and last name. This option is available on
`load` and `unify` commands.
* Reset option while loading.
Before loading any data, if `reset` option is set, all the relationships
between identities and their enrollments will be removed from the
database.
* GrimoireLab support.
GrimoireLab identities and organizations YAML files can be converted
to Sorting Hat JSON format using the script `grimoirelab2sh`.
** Bugs fixed: **
* Fix tables created with invalid collation. In some random situations
Sorting Hat tables appear with an invalid collation. This is related
to a wrong generation of the DDL table statement by SQLAlchemy, which
may randomly prepend the collation information (`MYSQL_COLLATE`) to
the charset one (`MYSQL_CHARSET`), causing the former to be ignored.
Changing `MYSQL_CHARSET` to `MYSQL_DEFAULT_CHARSET` fixed the problem.
* Remove trailing whitespaces in exported JSON files. This error is only
found in Python 2.7 due to a bug in the standard library with
`json.dump()` and `indent` parameter. (#103)
* Update profile information when loading identities. So far, profile
information was set only the first time a unique identity was loaded.
With this change, it will be updated always, except when the given
profile is empty
## Sorting Hat 0.4 - (2017-07-17)
** New features and improvements: **
* Mailmap and StackAlytics support.
Mailmap and StackAlytics files can be converted to Sorting Hat JSON
format using the new scripts `mailmap2sh` and `stackalytics2sh`.
* Unify by sources.
Giving a list of sources, this option allows to `unify` command to
merge only those unique identities which belong to any of the given
sources.
** Bugs fixed: **
* Encoding error generating UUIDs in Python 3. Some special characters
cannot be encoded in Python3. This caused function `uuid()` to fail
when converting those characters. 'surrogateescape' handler was
added to fix that problem.
* Force `utf8_unicode_ci` collation on MySQL tables to fix integrity errors.
MySQL considers chars like `β` and `b` or `ı` and `i` the same, when
some collation values are set (i.e `utf8_general_ci`). This can raise
integrity errors when Sorting Hat tries to add similar identities with
these pairs of characters.
For instance, if the identity:
('scm', 'βart', 'bart@example.com', 'bart)
is stored in the database, the insertion of:
('scm', 'bart', 'bart@example.com', 'bart)
will raise an error, even when these identities have different UUIDs.
Forcing MySQL to use `utf8_unicode_ci` fixes this error, allowing
to insert both identities.
## Sorting Hat 0.3 - (2017-03-21)
**NOTICE: UUIDs generated by SortingHat < 0.3.0 are no longer compatible.
Please check "Compatibility between versions" section from README.md file**
** New features and improvements: **
* New algorithm to genere UUIDs.
UUIDs were generated using case and accent sensitive values with the seed
`(source:email:name:username)`. This means that for any identity with the
same values in lower or upper case (i.e: `jsmith@example.com` and `JSMITH@example.com`)
or with the same values accent or unaccent (i.e: `John Smith` or `Jöhn Smith`)
would have different UUIDs for any of these combinations.
The new algorithm changes upper to lower case characters and converts accent
characters to their canonical form before the UUIDs is generated.
This change is caused by the behaviour of MySQL with case configurations
and accent and unaccent characters. MySQL considers those characters the same,
raising `IntegrityError` exceptions when similar tuple values are inserted
into the database. Generating the same UUID for these cases will prevent the
error.
Take into account that previous UUIDs are no longer compatible with this
version of SortingHat. You should regenerate the UUIDs following the steps
described in section *Compatibility between versions* from `README.md` file.
** Bugs fixed: **
* Any non-empty value in email field was used during the affiliation. This
caused some errors for non valid email addresses like with 'email@' cases,
which raised a `IndexError` exception. This bug has been fixed using valid
email addresses only during the affiliation.
* Invalid database names were allowed in `init` command.
## Sorting Hat 0.2 - (2017-02-01)
** New features and improvements: **
* Auto complete profile information with `autoprofile` command.
This command autocompletes the profiles information related to a set of unique
identities. To update the profile, the command uses a list of sources ordered
by priority. Only those unique identities which have one or more identities
from any of these sources will be updated. The name of the profile will be
filled using the best name possible, normally the longest one.
* GiHub identities matching method.
This new method tries to find equal identities using those identities from
GitHub sources. The identities must come from a source starting with a `github`
label and the usernames must be equal.
** Bugs fixed: **
* The parser for Gitdm files only accepted email addresses as valid aliases.
This has been modified to accept any type of aliases. Thus, the input file
passed to `gidm2sh` script will be a list of valid aliases instead of email
aliases.