Home

Cardbox Talk

 

CardboxForumsCardbox Talk > "Duplicate records on a Search"

Duplicate records on a Search

Searching on an indexed term and having duplicate records

Current user: [none]
Register / Log In · Help

Posted By Post

PeterG

4-Jul-2012 12:40

When I do a search on a single indexed term I would expect only records with that term to be selected/filtered. However, a search will often produce multiple duplicates of the filtered records. Any suggestions as to how I can prevent this? Thanks.
Peter.

Jelly Belly

4-Jul-2012 13:41

Firstly I would turn on "Highlight Matches" under VIEW in the toolbar to see why you are getting these results. Then give us some idea of how you are searching and the terms you are using.

bert

4-Jul-2012 13:53

If you search a term, and Cardbox finds n records with that term, then you can be sure there are n records in your database containing the searched term.
If there are records that are equal, then there is probably often a record edited, then saved as... You make a copy of the existing record then.

In such a case you get of course duplicate records while searching: you did add them!

To clean up duplicate records, please use tools, deduplicate.
An amazing tools for cleaning up duplicate records.
regards
bert

Charles Welling

5-Jul-2012 08:02

I'd like to stress that Bert is right. You can NEVER find duplicate records if there are not duplicate records present in your database. It makes no difference what you search or how you search; a single record cannot be shown twice unless it has been entered twice.

There are two options to prevent the use of "save as new" or "duplicate record", which obviously is another cause of duplicate records.

Edit the native format, choose Tools > Toolbar > Editing Records.

1. Choose the toolbar button for "Save As New" and change it's action to "do nothing".

2. Alternatively, you can assign a macro to the toolbar button. This is the better option, as you can make the macro issue a warning and offer you the option to save the record in the normal way. Just "do nothing" may be confusing.

Changing the behaviour of the Save and Save As New buttons will also change the behaviour of the corresponding menu items. This is an undocumented feature.

The above will make sure that you will never use Save As New again.

The usual way to use Duplicate Record is to press CTRL-D. Again, edit the native format and use Tools > Keyboard and add a new keystroke. In this case add CTRL-D which will override the built-in action of CTRL-D. Make it either "do nothing" or trigger another macro.
Adding CTRL-D as a keystroke does NOT change the behaviour of the menu item, so it will still be possible to pick "Duplicate Record" from the menu.

There is a way to completely prevent the use of Duplicate Records, but explaining that would make this post a bit long and perhaps complicated. Please let me know if you want to know how to do that.

If you need help with writing the above mentioned (very simple) macros, please add a post to the macros section.

PeterG

8-Jul-2012 11:51

To Bert, Charles and Jelly,

Thanks for taking the trouble to reply to my problem. Perhaps I should give a little more detail, as the suggestions don't seem to have fixed the problem I am having (it's almost certainly me, not Cardbox!). As I'm editing a journal I've set up the data base so that there are a number of fields I can index and search on. These include a unique number for the manuscript (entered manually), the author's name, the Decision and the title. If I then want to check that a new author has taken acount of material we have previously published I first do a search on the Accept field. This reduces the number of records to about 10,000. If I then search on the relevant keywords in the Title field this throws up another search level. What I would then expect is a set of records containing the keywords but not duplicated (I save revised records as 'Save', never through 'Save as New'). However, what I am presented with are the records I want but also duplicates of titles rather than a unique set of titles.

If I now click on 'Deduplicate' and identify the indexed field 'Title ', I have the message 'There are no records in the current selection.'

So I am left having to check records manually to ensure that I haven't accidently referred to the same record twice (even though it seems to appear only once in the data base as it is linked to a unique number).

I hope that's clear? I quite see that a single record cannot be shown twice uless it's been entered twice, but I don't seem to be entering it twice, yet it's showing up more than once. I inherited this data base from the previous editor, so it's quite likely that there's sometiong going on in the background that I'm unaware of!

Thanlks again for your suggestins.

Peter.

bert

8-Jul-2012 13:53

You wrote:
<..>
"I first do a search on the Accept field. This reduces the number of records to about 10,000. If I then search on the relevant keywords in the Title field this throws up another search level."
<..>

You get the records you did select by your search commands.
B.e.: you make a selection resulting in a records set of 100 records, and in all records is in the Publisher field "Wall Street Journal", when you print a list or so, you get 100x Wall Street Journal in that list.

If you call that "duplicates" (and as far as I understand you try to explain this?), than this are not duplicates. It is simple a collection of 100 records containing in one field the same information (in this example).

If you want to eliminate 100x Wall Street Journal in reports, the best you can is developing a script for this (like in many other databases).
Do not confuse a search result (=records!) to a search report (= a manipulation of a search result).

But perhaps I do not yet precise understand what you mean. In Cardbox are not things going on in background. You enter search terms, you get what you asked.

Charles Welling

8-Jul-2012 14:54

Peter, it's still a mystery to me.
You say you search for keywords in the field Title. Are these titles all different? For instance:

"Birds of Great Britain."
"Eagles are large birds."
"Let me tell you about the birds and the bees."

A search for "birds" would return these three titles, but these titles are different. No question of duplicates.
If there were two different articles with the same title, you have that title twice, but that would be correct.

Could you perhaps give us an example of what you call duplicate records? Two will do.

PeterG

8-Jul-2012 18:26

Dear Bert and Charles,

By 'duplicates' I mean that exactly the same record appears more than once, even though it exists on the data base just once. So to use Charles' example:

If I select in the Decision field the indexed term 'Accept' this filters out all but the records that have been accepted (and the display identifies that we have moved from Level 0 to Level 1).
I then move to the title field and insert a keyword such as 'birds' in order to identify a unique record with the title 'Birds of Great Britain' (in Charles' example) - and the display records that we have moveed from Level 1 to Level 2.
What then comes up, of course, is the record with that title BUT that same record will appear a number of times (with any other records that have similar words (for example, 'Birds of Great Britain and Ireland'. It's the repitition of exactly the same record (with the identical title written by the same person with the unique identfying number)that I'm trying to prevent.

Frustrating!

Peter.

bert

8-Jul-2012 19:48

If you want to select "Birds of Great Britain"
and not
"Birds of Great Britain and Ireland"
or
"Birds of Great Britain and Holland" then you can select that by data search with option entire field as one unit switched on.

If data search is too slow, than you first can select by index search "Birds of Great Britain", and then perform data search.
The right result you get of course in this case if there is nothing else then "Birds of Great Britain" in that field.

A same record never is presented more times. Cardbox keeps it in your selection on any level if you have commanded to so by your searching. So I do understand why this is frustrating. I am glad with this feature!

bert

8-Jul-2012 20:33

(so I do NOT understand...)

Charles Welling

9-Jul-2012 06:59

Please try this:
Make a selection and look for any unique ID that comes up more than once.
E.g. ID "ABC123" appears 4 times.
Go to level 0 and select "ABC123".

Do you have still have 4 records?

To be on the safe side, export the contents of these records to a file. Use the internal format.

Delete one of the apparent duplicates. Then make the original selection once more.
How many of "ABC123" do you have now?
If you have three now, you may safely assume that there were actually duplicates in your database. Delete two more and see if the result is 1.

After this, you might try another deduplication, but use the unique ID for deduplication, not any keywords. Don't forget to sort the database on the field "ID" before you start the deduplication.

PeterG

10-Jul-2012 09:28

Thanks for the suggestion. I did as you suggested (did a search on a term which produced at level 1 an exact copy of the record - i.e. there were two identical records at level 1. I then deleted one of these and returned to level 0, but the record was missing (I reloaded it by clicking 'undo deletion'), so it looks as if there are no actual duplicates at level 0. And yet when I do a search they appear.

Peter.

bert

10-Jul-2012 10:15

extreme rebuild you database!

Charles Welling

10-Jul-2012 11:15

It seems as if we could go on searching for duplicates indefinitely, but at this stage I would advice you to cure this the hard way: pull the plug.

Here we go:

Be at level 0 and export your database to a file using the internal format (*.dmp).
Rebuild your format file, just to clean it up.
Close the database and Cardbox.
Rename the FIL to something like "MyDatabase.fil.old". Don't delete it, just change its name so you can restore it in two seconds, if necessary.
What you have now is a rebuilt format file, no database and no trouble.
Start Cardbox and tell it to make a new database. Cardbox will show you the name of the format file. Click OK and Cardbox will create a new and empty database, based on the original FMT.
Read the records from the *.dmp (internal format) file.

This is an entirely safe procedure as you will keep a copy of the old database.
If level 0 did not show any duplicates, the exported records in the *.dmp will not contain any.
You will have a brandnew, clean and shiny database.

PeterG

11-Jul-2012 08:25

I was worried that you might suggest this eventually! Thanks for such clear instructions though. I've exported parts of the data base before (I have it on 2 PCs, so need to keep the second system up to date with the master one)without any trouble, but last night when trying to export the whole data base it kept freezing about half way through with an error message (I've the detailss on its logged report but won't bother you with them). Today I'll export it in blocks and then carry on with your suggestions and let you know what develops.

Thanks again for your trouble.

Peter.

bert

11-Jul-2012 09:33

Use extreme rebuild - you can choose there to let your damaged records blank. Easy to find the damaged positions later.
Your current db is renamed to databasename.fil.old. Don't need to write in blocks. All in one batch repaired.
Why do difficult - use the function that is made for that!
Regards
Bert

Charles Welling

11-Jul-2012 09:43

The freezing may indicate that there is something seriously wrong with your database. Exporting it in blocks may or may not work, but there's no harm in that.

If I were you, I'd do another export in the external format (*.ext). It will give you a file which you can read with Notepad, Wordpad or Word and it will enable you to retrieve any data by copying it from this file.

When you do so, you'll have two files (*.dmp and *.ext). The DMP is the best file to use for an import, as it contains all the indexing information and even images if you have them. The EXT will contain your data in a form that can be easily read by you.

Then follow Bert's advice to rebuild your database, which is something you should do on a regular basis anyway. Try your two exports first, then rebuild your database and see if Cardbox reports any corrupted records. Tell Cardbox to keep corrupted records as blank records. The place where any blank records occur will give you an indication of which records were deleted. You may retrieve them from the external export file.

After having rebuilt your database, do another export, as above. Keep the export files separate.
Then proceed with building a new database:

If the rebuilding process did not report any errors and the second export went smoothly, then I'd use the internal file from the second export to import your records.
In the highly unlikely event that the rebuilding process messed up your entire database you may use the first internal file.

If the internal files for some reason cannot be read: use the external files. Any errors in these files can be corrected by hand.

And last but not least: you said you have your database on two PC's. If these PC's are connected, please consider installing the Cardbox server. It will save you the trouble of keeping both versions up-to-date. There's always a risk in keeping two versions that have to be synchronised.

PeterG

11-Jul-2012 11:34

Dear Bert and Charles,

It wasn't allowing me even to export a single file, let alone blocks, but I did as Bert has now suggested and the Extreme Rebuild of the data base produced no corrupted records and the system reported no problems, but interestingly it then did allow me to export the data to a .dmp file. Unfortunately it then repeats its 'Cardbox has encountered a fatal error and has had to close' message. So Charles is obviously right that there is something badly wrong, but perhaps it is with the format file?

However, I was then able to continue following Charles' advice to restart Cardbox and create a new database by first loading the renamed format file and then load the .dmp file. At this point the reloading of the .dmp failed and Cardbox repeated its error message.

I don't know if it's of any relevance but I run two databases in Cardbox for the journal. Database 1 captures information from authors, and this is the one causing the problems I've been sharing with you. Database 2 captures information about the academics who review and report on the authors' papers. This second database is rock solid, with no 'duplications', error messages, problems with exporting data etc. Obviously it has a different format file to the one causing problems.

Bearing in mind that I have a backup (in fact more than one!) on the master PC and having been working on the second PC's copy of Cardbox I wonder if, rather than do as Charles now suggests, I should delete the whole of Cardbox from my second system,reload it and then copy the format file from the first system and its data base? Pretty drastic! The only problem I can see with this approach is that by copying the format file and data base from PC 1 to PC2 I'll still have my problem of 'duplicates' in levels 1, 2.
and so on.

I quite take your point, Charles, about having the same database on two systems. PC1 is the master which the administrator works on, PC2 the slave in that I often have to access the data base when I'm away from the office (to identify reviewers, check titles, etc., but I never input data into the slave, just read it. We update the slave every week, sometimes more often.

Regards,

Peter.

Charles Welling

11-Jul-2012 12:29

There are still some options.

Did you export to a file in the external format? If so, you could browse through this file and look for errors. It will show you the records in much the same way as in Cardbox itself and any corruptions will be clearly visible.

You could also use this file to reload your database.

A second option is to create a new format file.

If you are going to read an external file (*.ext), then you should use the same field names as in the original database. An *.ext file contains the field names and Cardbox will read the *.ext and match the field names to the corresponding fields in your new format file.

If you want to read your *.dmp, you should make a new format file and give it some extra fields. Don't give the fields any meaningful names yet.
The *.dmp does not contain field names and Cardbox will just enter the field #1 from the *.dmp into field #1 in your new format file. But, if your predecessor ever deleted some fields from the database, there may be a field #8, even if your database only held 6 fields when you dumped it. That's why you should enter some extra fields.

After

Charles Welling

11-Jul-2012 12:33

Continued....

After you've read the *.dmp succesfully (I hope), have a look at the data. E.g. field NONAME1 may contain your unique ID. Rename the fields to whatever names they should have and delete any unused (empty) fields. This is pretty drastic too, but it will eliminate any corruption from the old files.

Restoring your database from another PC is also an option, but don't forget that whatever went wrong, may already be present in those copies.

bert

11-Jul-2012 14:11

Did you also tried "rebuild format"
Seems an option in this. Also: the old fmt is renamed by Cardbox tot databasename.fmt.old
And just to be sure: you run Cardbox 3.1?
regards

PeterG

18-Jul-2012 11:19

Dear Bert and Charles,

Sorry for the delay in getting back to you (have been away working).

Bert, I'm running Cardbox version 3.0 Professional Edition.

I thought I'd start again from the beginning, so checked that I was still getting 'duplicates' at Level 1 and following (which I am). I then exported the database again as a *dmp file (which generates a Fatal Error message and shuts down Cardbox - do you think this error occurs when Cardbox meets a corrupted record, as it hasn't happened before when I update the slave system?), renamed the .fil and then restarted Cardbox. When I clicked File/New Database I wasn't shown the database file, but had to hunt for it and double-click it to open it. Everything loaded OK and when I ran a search the same 'duplication' problem occurred.

I ran Rebuild and also the Extreme Rebuild on the database, but no damaged records were identified.

I then exported in external format and read the files in Notepad. None appear to be corrupted, but to my surprise there are a number that seem to be duplicated or even triplicated. So I went back to the master database (the one on my master PC) and, although I've done nothing on it recently, there appear to be four sets of records (they are not an exact copy of the data base, but some appear more than once. Yet when I first contacted you I'd run a check for duplicate records and there were none on level 0 (unless I did a search and moved up levels).

So what I'm proposing to do now is:
 to export the whole data base and the original .fil from the master to the second PC...
 and working from Level 0 delete the duplicates and triplicates (it will be time consuming, but I'll do this one record at a time)...
 I'll then run a search and see what happens...
 then I'll break the rule, save and then delete all the data from the master data base...
 and load what should be a clean database onto it from the slave.

What do you think?

By the way, when I've moved records from the master to the slave PC (say record 4000 to the end of the data base) I tag record 4000 and then click Export (via .dmp). I NEVER move from the slave to the master database so I am really puzzled as to why there should be duplicates now at Level 0 on the Master database.

With regards,

Peter.

bert

18-Jul-2012 13:46

I am really am curious about this all.
If you see duplicates, I wonder if Deduplicate cannot find them.

If you mail me at 001meworldmail.nl, I really like to help to try to solve you problem (en find what caused it (free).
Replace in address me to a @.
regards
bert

Charles Welling

18-Jul-2012 13:47

Well, this is quite a mess. The good part is that there's no mystery: there are duplicates as we suspected all along.
The fact that you didn't see duplicates on Level 0, but you did on Level 1 or 2, makes sense. If for instance a record in position 10 has a duplicate in position 500, you will never see them together unless you make a selection which eliminates the records that lie in-between. It's obvious: as these records are duplicates, they will show up together when you search for data they both contain.

I'd leave both the master and the slave database alone. Make a new copy of the master and clean that copy up. I'm convinced you can use Deduplicate if you use it the right way. That means that you will have to sort your database in such a way that the duplicates will be adjacent. Then use the Deduplicate command and tell Cardbox what to look for. Read the manual if you need to.

If this works, rebuild the clean copy because many records will have been deleted. Check it, check it once more and copy it back to the original master/slave. Keep a backup of those for a while just to make sure.

And try to figure out where these duplicates come from. Have a look at the format file. Your predecessor may have changed the keystroke CTRL-S to execute the command "Save As New" instead of "Save". That's just one of the possibilities.

PeterG

19-Jul-2012 12:41

Dear Charles and Bert,

Many thanks for working through this issue with me. I've now a clean data base (and in fact at my wife's suggestion have archived the fulldata base, but now keep current only records starting from the year 2000, a much smaller total of 1,500). My only worry now is what on earth created this problem in the first instance. I'll have a good look at the format file as you suggest, but my first check seems to show that nothing has been altered.

Bert, it is very good of you to offer to have a lok at the data base. However, much of the material is confidential, so I cannot let you see the data. However, I could create a dummy data record or two and let you have that and its format file if you like.

Many thanks again for your time and trouble.

With regards,

Peter.

PeterG

14-Sep-2012 16:58

Me again, I'm afraid. The good news is that I have a clean data base on one of my two PCs. A further bit of good news is that I have identified how the duplicates were appearing on my slave PC. What happens is that I work on the data base on the master PC (let's say 20 records are updated, with a data base of 2,000)and so then I need to update the slave's data base.

I do this by deleting on the slave all records from 1,789 (to avoid duplication). I then move to the master data base, tag record 1,800, and then expect to be able to export in internal format 20 records. What in fact happens is the whole data base of 2,000 records is saved for exporting! THIS is the reason for the cursed duplicates I had problems with before.

So why is Cardbox not allowing me to save and then export only the 20 records I've updated?

Thanks,

Peter.

Charles Welling

14-Sep-2012 19:06

I think that you forget to take the last necessary step, which is to select the tagged records. When you tag all records from 1800 up, your selection will still contain ALL records until you use Select > Tags.

Repeat your procedure and take a look at the number of records in your selection. Is this 20 or 2000?
If I may make an educated guess I'd say that the screen will read "Level 0: 2000 records" and "20 records tagged".

But as the Cardbox server is now free, why not install the server? It will make this whole procedure obsolete.

PeterG

20-Sep-2012 16:39

Thanks again for your advice, Charles. Yes, I forgot that step (and the maths was wrong too!) - age is catching up on me. I guess I'll have to think about the server now, especially as the database is as it should be.

By the way, I have the default setting for ordering each record(i.e. the most recent addition appears as the latest record). I also give each record a unique number which is indexed, and as you would expect the most recent record has the highest number (currently 4143) I cannot see from the manual how I might re-order some records so that they appear earlier than they currently do. That is, how would I move record 1300, for example, so that it is positioned within the data base as the 500th record, but retaining its unique number?

Many thanks again for your advice.

Peter.

bert

20-Sep-2012 21:19

Move a record to another position can be done.
1 - select the record you want to move
2 - write that record to 1.dmp (internal backup file)
3 - delete that record
4 - then selection level 0
5 - if you want that record b.e. on position 500, select then records 500: (all records above 500)
6 - write that selection to 2.dmp
7 - delete then that recordss
8 - then selection Level 0
9 - read 1.dmp (import)
10 - read 2.dmp.
Your record is on the position you want....

Of course, you do not want to do this more then one time.
If you want to do it often, try to compose a macro for this. I did this years and years ago.
First record a macro doing this, then try to improve/change it. You will experience, it is not so difficult as it seems.

Regards
Bert

PeterG

21-Sep-2012 15:28

Thank-you, Bert. Yes, I don't want to do this too often, but what you suggest worked perfectly.

Many thanks again. Have a good weekend.

Peter.

Quick Reply

Please log in or register before trying to post a reply.

 
© 2010 Cardbox Software Limited   Home