Home

Cardbox Talk

 

CardboxForumsCardbox Talk > "collating sequence"

collating sequence

small bug in collating table

Current user: [none]
Register / Log In · Help

Posted By Post

Charles Welling

25-Apr-2014 06:22

There's a small bug caused by the SPLIT line in collating tables that you should be aware about.
When the collating table contains a SPLIT line, the character that is used for the SPLIT is NOT indexed when it occurs preceded by, followed by or surrounded by spaces.
It happens to any character used for the SPLIT.
This may be the cause of inaccurate search results. In fact, that's how I found out.

Example for a SPLIT with a hyphen:

"test-case" is indexed (as it should be) as:
CASE
TEST
TEST-CASE

"Female suffrage 1900 - 1920" however is indexed as:
1900
1920
FEMALE
SUFFRAGE

A phrase search for "Female suffrage 1900 - 1920" results in 0 records.
When the SPLIT line is removed from the collating table, indexing is restored to its normal behaviour but, of course, without split words.

bert

25-Apr-2014 09:40

If you change HYPHEN to SPLIT in Contacts.fil it is rejected by Cardbox.
Entering SPLIT-
in stead of
HYPHEN -

generates in Sample.fil an error :

Load Collating Sequence from File
---------------------------
Error in line 46: only single characters may be flagged as SPLIT.
---------------------------
OK
---------------------------

(Line 46 is last line of the file: (==00DF(ß) S S) and has nothing to do with split.
**This is a bug**.

This error I could solve by adding the "-" as indexed character.

My solution is in this type of thing:
I could never explain to anyone why Cardbox Noord-Holland indexed as Noord as well as Holland as well as NoordHolland.
So I removed HYPHEN as well as SPLIT as soon as possible and did add "-" as indexed character.
It is easy to explain that if you type a search thing, you will find the thing you typed. So searching Noord-Holland will not give you "Noord Holland".
Cardbox is not Google ;-). Also easy to explain.

My workaround for the "bug" is changing this line:
32: _ 0020 -
and remove HYPHEN or SPLIT.
If you do this, you can find "Female suffrage 1900 - 1920" and you can easy explain that words not are indexed else then we typed the things.

You write:
"Female suffrage 1900 - 1920" however is indexed as:
1900
1920
FEMALE
SUFFRAGE
That is the correct way. The "-" was not surrounded by words, but by spaces. SPLIT has to index "-" only if it is enclosed by words.

However, when "-" added to the indexed characters, SPLIT prevents indexing this. **That is also bug.**

Nice was to try "Female suffrage 1900-1920". The "-" is used for negative numbers, hyphen, index and in a date as separator...
Impossible things for Cardbox. Which database can handle this without limiting fields+search properties in a field?

Btw: I always make the SKIP and DELETE line a little longer. This to prevent that numbers, when there is a a comma, quotes, a € or £ etc before a number, numbers are not indexed as a words. That is not a nice function (force that numbers are index into Word index), but a pitfall.

Regards
bert

Charles Welling

25-Apr-2014 13:09

First of all, when you use
SPLIT -
then the hyphen MUST be an indexable character. It's in the manual and anyone could have thought of that. When characters are part of a word, then they must be indexed.
The point is that any character can be indexed on its own, but not when it's in the SPLIT line. Try an "A" as the SPLIT character and "A" will no longer be indexed.

****
"Female suffrage 1900 - 1920" however is indexed as:
1900
1920
FEMALE
SUFFRAGE
That is the correct way. The "-" was not surrounded by words, but by spaces. SPLIT has to index "-" only if it is enclosed by words.
****
No Bert, that's NOT the correct way. Why not? There's nothing to split here: split works on concatenate terms. Here "-" is a separate single-character word and it should be indexed as such. As soon as SPLIT is removed then Cardbox will index the "-", so somehow SPLIT alters this behaviour. It shouldn't.
Note that Cardbox sees the hyphen as a valid character; it does not generate an error when it is typed into a search string. So you suppose you can search for it. But you can't.

This is what the index should look like when a database contains the title "Trafford Leigh-Mallory and the RAF 1940 - 1943."

with a SPLIT:
-
1940
1943
AND
LEIGH
LEIGH-MALLORY
MALLORY
RAF
THE
TRAFFORD

without a SPLIT:
-
1940
1943
AND
LEIGH-MALLORY
RAF
THE
TRAFFORD

The separate "-" being the hyphen between the years, NOT the hyphen between Leigh and Mallory.

bert

25-Apr-2014 15:15

Noord-Holland is standard indexed as (already in DOS version) NOORD, HOLLAND and NOORDHOLLAND
Noord(space)-(space)Holland is standard indexed already since 1987 or so as NOORD and HOLLAND. They separate terms. The "-" is not indexed of course: without a numbers around it is seen as a punctuation.

Help tells us that SPLIT works as HYPHEN. Only difference: SPLIT character is left between the terms.
So SPLIT indexes to NOORD, HOLLAND and NOORD-HOLLAND.

That is how I understand the Help.

I repeat:
- changing HYPHEN of SPLIT is not possible, caused by a bug. You cannot read the changed col file.
- Adding "-" as index character to the col file should be a workaround for this. However when SPLIT or HYPHEN is used, "-" is never indexed although is is added as index chanracter. That is a burg.
- When SPLIT + HYPHEN is left away + "-" is added to the col file, "-" is indexed.
Regards
Bert

Charles Welling

25-Apr-2014 18:47

I've never experienced any trouble with reading a collating table, either with or without SPLIT or HYPHEN, so it's unlikely that that's a bug in Cardbox.

Adding the SPLIT character to the indexed characters is necessary, not a workaround. I quote from the help, where there's an example of a SPLIT with a full stop (.):
"Note that the second line in the definition really is needed. It tells Cardbox that the full-stop character is indexable."

And not indexing the SPLIT character although it has been added to the indexable characters must indeed be a bug. That's what I've been telling the whole time.

bert

25-Apr-2014 20:25

Pasted from help:
"It (SPLIT) splits the word in exactly the same way as HYPHEN, but then, when it indexes the whole word, it does not discard the character that caused the split."
That all tells me that the SPLIT character will indexed in SPLIT situations.

However, indeed in a another Help part on another page that is suddenly called ""inclusive" splitting". There is added that a second line is needed which is not found in the main description of SPLIT. A pitfall.
When the second line not is added, the col-read problem occurs.
The last thing tells me that the SPLIT character will indexed in SPLIT situations as well as separate situations.

And I agree, that does not work right.

It was all a little confusing caused by your example "1900 - 1920". That seems that you expected things from SPLIT in that situation while SPLIT must do nothing here. Only "-" had be to indexed.

Nice weekend,
Bert

Charles Welling

5-Jun-2014 08:22

Hi Bert,
Remembering your remark:

"I could never explain to anyone why Cardbox Noord-Holland indexed as Noord as well as Holland as well as NoordHolland."

I came across a useful application of the HYPHEN command. Keywords may be the containers of other keywords, e.g. the word "sweetshop" contains the word "shop". An index search on "shop" would have no results. But, when you use HYPHEN with, for instance, the middle dot (·), you can have both words indexed.

sweet·shop is indexed as
SHOP
SWEET
SWEETSHOP

When you use proportional fonts, the middle dot is almost invisible, and doesn't disturb the layout. On the Internet, using ASP or a single line of Javascript, you can remove the middle dot (or any other inconspicuous character) entirely from the output.

bert

5-Jun-2014 09:55

You are right: this gives also nice possibilities. However, these tricks are fine for IT-ers which can implement that type of things (and many other Cardbox tricks) to generate a plain, easy to explain, simple interface to front-end Cardbox users.
However, if in a new record the dot is forgotten the trick does not work any more.

So, my intention is "what you write, you will find". Can it more easy?
And I train Cardbox users to use wild cards on the right place, on the right time.

Charles Welling

24-Nov-2014 09:06

I just ran into another anomaly when using hyphens, and it's a serious one.
My database indexes hyphens as they can be part of a title. I don't use split anymore, because that causes hyphens NOT to be indexed. That's the above mentioned bug.

Now try this: use a hyphen as the first character of a word, without quotes, such as:

-thisisabsolutenonsense

Cardbox will return (almost) the entire database in the search result. A preview is empty (it should be!), but Cardbox seems to find a lot of records anyway. Ans of course do not match the search.

I've modified the search form in the Internet version to deal with this, but when you use the Client you can't. There's no error message, just a hugely wrong search result.
I'm very curious to know what happens here!

bert

24-Nov-2014 09:46

I think you overlook boolean search.
If you remove split & do not index "-", boolean search is still working.
Indeed as you search only -thisisabsolutenonsense, you exclude thisisabsolutenonsense. No bug, but perfect search!
If you index the hyphen and want to search for -thisisabsolutenonsense you have you search between quotes avoiding boolean search. Just a little like Google...
Regards
bert

Charles Welling

24-Nov-2014 12:35

Indeed, Bert, I overlooked the Boolean function of the hyphen. That is the trouble with using archaic things like "+" or "-" in a modern system. And then there's using the hyphen in dates and the same character being used as a minus sign in numbers. And combinations thereof. Who needs to use them as Booleans on top of all that?

bert

24-Nov-2014 13:19

I use boolean search every day.

Quick Reply

Please log in or register before trying to post a reply.

 
© 2010 Cardbox Software Limited   Home