Discussion:
anjuta search (was Re: anjuta docman plugin updates)
Johannes Schmid
2007-12-24 23:47:13 UTC
Permalink
Hi!
Best wishes for the festive season, if you have one of those now ...
Thanks - same for you!
* Added command and keybinding (F3, stolen from build plugin because all the editors I sampled use F3 for this) to repeat the prior search-box operation (hence some more docman changes)
* When <ctrl>f is repeated with a search-box already open, any currently selected text is passed to the pattern entry
* Added command and keybinding (<shift>F3) to repeat the prior "main" search, if any
* Reinstated anj 1.2.x command and context-menu item to "list all matches" (most of that code was still there)
* Conformed data pointers used in the plugin and in search prefs. I suspect the search prefs were not doing anything
Might be true. Maybe we should move the preferences to a page in the
Anjuta preferences to clean up the dialog.
* Concluded that since "non-editor-native" searching is needed for regex and not-opened files, that mode might as well be retained for all search/replace
Yes, of course. Though for the quick search using the native editor
interfaces seems better to me.
* Use slices for allocated memory
* Convert newly-opened file buffers to UTF-8 if needed, before use, because search-patterns from UI and all text in opened buffers (so Naba said) are already in that encoding
* Added funcs to map reasonably quickly both ways between byte-position and char-position
Interesting. I tried to do something like that but ended up being much
too slow. What did you use?
* Added more checks about which search operations can be performed
* Some memory cleans, including at plugin deactivation
* Fixed several minor bugs
* Lots of style conforms
* Deployed line_t and position_t which are both gint for now at least
* Quite a bit of API documentation and other explanation
Next
* implement Knuth-Morris-Pratt algorithm for forward searching at least (backward too if possible)
Wow, that would be nice of course. Since I am (becoming) an engineer
rather than a computer scientist and can't say much about it but from
what I read it would be a lot faster.
* implement utf8 form of regex searching
GRegex does this though it's only in the newest GLib. We probably need
some #ifdef'ing but maybe we can disable regex searching completely for
older Glib versions. People would need to update if they want it.
* fix function/block/selection searching (maybe I broke those, but anyhow, they don't work)
I guess they did not work before. The problem is that we need to detect
a block and this depends on the language. We could of course limit it to
some languages (C/C++/Java, maybe some else) for now.
* decide which of the 2 provided versions for both find-next and find-previous are more user-friendly
Which two version do you mean?
* improve some UI logic in the search dialog
* implement some way to turn off highlighting other than a repeated search with 0 matches (might be just any edit of the buffer)
* testing and bugs .... off-by-one etc etc
BTW, in playing with un-opened files, I noticed that ianjuta_docman_add_buffer() ignores content, I fixed that and anjuta_docman_add_editor() to make paths into uris so that the notebook tabs work properly for files opened during a search. Except that such files are marked as dirty even though they are just bookmarked, say.
You can use (and you might know that) ianjuta_file_set_dirty (FALSE) to
fix it. It's possible that they are dirty because of the loading. But if
you are unsure, I will have a look at it.

Well, you are really doing an amazing work! Thanks a lot!

Johannes
Johannes Schmid
2007-12-25 10:51:20 UTC
Permalink
Hi Tom!

(Plese CC the list on this issue, I think it's interesting for anyone)
Post by Johannes Schmid
* Conformed data pointers used in the plugin and in search prefs. I suspect the search prefs were not doing anything
Might be true. Maybe we should move the preferences to a page in the
Anjuta preferences to clean up the dialog.
I also thought that, but playing with glade is beyond my current expertise.
Well, I might have a look at that later.
Post by Johannes Schmid
* Added funcs to map reasonably quickly both ways between byte-position and char-position
Interesting. I tried to do something like that but ended up being much
too slow. What did you use?
Cached char-index as well as byte-index for each EOL. Each conversion then needs only to find the nearest reference point and g_utf8... from that point. But I've not evaluated how fast it is for heavy usage. Nor have I yet decided how best to update such cached values when replacing.
The caching could really make the difference in this case.
Post by Johannes Schmid
* implement utf8 form of regex searching
GRegex does this though it's only in the newest GLib. We probably need
some #ifdef'ing but maybe we can disable regex searching completely for
older Glib versions. People would need to update if they want it.
I saw stuff about GRegex but have never yet investigated it. Presumably good to use if available. Otherwise, probably ok to just upgrade PCRE_REQUIRED to 5.0 or so, and set PCRE_UTF8 and (because we will have already validated) PCRE_NO_UTF8_CHECK flags when compiling a regex. The availability of utf-8 support can be discovered at runtime. If it's not actually available, I guess we use the process anyway (as happens now), and hope that a search doesn't involve any fancy treatment of multi-byte chars.
GRegex is a very thin wrapper around pcre (and uses it internally). I
just makes everything a bit more GObject-like but anything else is more
or less the same.
Post by Johannes Schmid
* decide which of the 2 provided versions for both find-next and find-previous are more user-friendly
Which two version do you mean?
There's a pair that looks for the pattern from the most-recent search dialog, and another pair which IIRC looks for current-word or something. I think the latter pair are coded but not included in the UI, probably they're the old incremental search functions. As the seach-box process is more like the latter pair (and now more so, with its easy-repeat capability), I guess I should just disable the code for the second pair.
Can you remember who's working on the build plugin, and tell/ask him/her to substitute F2 (I think that's still available) for the current F3 binding there ?
Hmm, in the moment anjuta uses the same keybindings as gedit by default.
I think we should stick to the GNOME convention here as the shortcuts
can be easily modified in the preferences.
I'm away on holiday for a week from tomorrow, so, signing off for now.
Happy holidays!

Johannes
t***@onepost.net
2008-01-05 11:23:47 UTC
Permalink
Hello all

Back onto this, and I have a progress report.

I've done the foreshadowed updates, but a couple of issues have arisen, and I've not yet done much testing. In due course, I'll probably invite testers to play with the new code, before submitting for review and hopefully svn.

Pcre with utf-8 was almost trivial.

I coded forward/backward KMP searching, but found it didn't do much. Turns out I didn't do enough homework. KMP algorithm is best suited to data with lots of repetition. Probably Boyer-Moore would be a better way to go.

I think the misaligned-selection problem is fixed. Scintilla expects and reports byte-positions for selection operations, even with utf-8 encoded buffers. But it uses char-positions for setting positions as used in anjuta's iterable-related functions. Since I need to do char<>byte transforms anyway, the least-expensive and fastest way to meet scintilla as well as sourceview needs was to add an editor-interface function to get whether the editor uses byte- or char-positions, and manage the search data accordingly. Probably should add a corresponding function to the iterable interface. While sorting this stuff out I cleaned up and quickened some of the scintilla iterable-functions.

I can't see any point in the "basic search" option which is selectable from the settings page in a search dialog. Essentially all it does is hide the scope-related widgets. Anyone want to keep this capability ?

During testing sofar I got extremely sick of too many clicks or presses to define and initiate each search operation, so I decided to clean up a bit by rationalising the dialog layout. In particular, I put all search parameters onto a single page of the notebook. There's more scope for smarts in the logic governing the relations between parameters.

In relation to key-bindings, there are already differences between anj and gedit:

open search box - <C>k vs <C>f
open search dialog - <C>f vs <C><S>f

and some similarities

find next match in current file <C>g (etc?)

But gedit's searching has no need for a "repeat search-box operation" nor "repeat search-dialog operation" shortcut.

So what do people want me to apply ?

Regards
Tom
Post by Johannes Schmid
Post by Johannes Schmid
* Conformed data pointers used in the plugin and in search prefs. I suspect the search prefs were not doing anything
Might be true. Maybe we should move the preferences to a page in the
Anjuta preferences to clean up the dialog.
I also thought that, but playing with glade is beyond my current expertise.
Well, I might have a look at that later.
Post by Johannes Schmid
* Added funcs to map reasonably quickly both ways between byte-position and char-position
Interesting. I tried to do something like that but ended up being much
too slow. What did you use?
Cached char-index as well as byte-index for each EOL. Each conversion then needs only to find the nearest reference point and g_utf8... from that point. But I've not evaluated how fast it is for heavy usage. Nor have I yet decided how best to update such cached values when replacing.
The caching could really make the difference in this case.
Post by Johannes Schmid
* implement utf8 form of regex searching
GRegex does this though it's only in the newest GLib. We probably need
some #ifdef'ing but maybe we can disable regex searching completely for
older Glib versions. People would need to update if they want it.
I saw stuff about GRegex but have never yet investigated it. Presumably good to use if available. Otherwise, probably ok to just upgrade PCRE_REQUIRED to 5.0 or so, and set PCRE_UTF8 and (because we will have already validated) PCRE_NO_UTF8_CHECK flags when compiling a regex. The availability of utf-8 support can be discovered at runtime. If it's not actually available, I guess we use the process anyway (as happens now), and hope that a search doesn't involve any fancy treatment of multi-byte chars.
GRegex is a very thin wrapper around pcre (and uses it internally). I
just makes everything a bit more GObject-like but anything else is more
or less the same.
Post by Johannes Schmid
* decide which of the 2 provided versions for both find-next and find-previous are more user-friendly
Which two version do you mean?
There's a pair that looks for the pattern from the most-recent search dialog, and another pair which IIRC looks for current-word or something. I think the latter pair are coded but not included in the UI, probably they're the old incremental search functions. As the seach-box process is more like the latter pair (and now more so, with its easy-repeat capability), I guess I should just disable the code for the second pair.
Can you remember who's working on the build plugin, and tell/ask him/her to substitute F2 (I think that's still available) for the current F3 binding there ?
Hmm, in the moment anjuta uses the same keybindings as gedit by default.
I think we should stick to the GNOME convention here as the shortcuts
can be easily modified in the preferences.
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Johannes Schmid
2008-01-05 11:38:44 UTC
Permalink
Hi Tom!
Post by t***@onepost.net
I think the misaligned-selection problem is fixed. Scintilla expects and reports byte-positions for selection operations, even with utf-8 encoded buffers. But it uses char-positions for setting positions as used in anjuta's iterable-related functions. Since I need to do char<>byte transforms anyway, the least-expensive and fastest way to meet scintilla as well as sourceview needs was to add an editor-interface function to get whether the editor uses byte- or char-positions, and manage the search data accordingly. Probably should add a corresponding function to the iterable interface. While sorting this stuff out I cleaned up and quickened some of the scintilla iterable-functions.
Nice! Well, in my local trunk I completely removed the methods relying
or offsets from ianjuta_editor_selection and replace them by those
taking iterators. Maybe, and if others agree, wait until I can commit
this (it depends on the gtksourceview-patch applied but this will be
done today, hopefully).
Post by t***@onepost.net
I can't see any point in the "basic search" option which is selectable from the settings page in a search dialog. Essentially all it does is hide the scope-related widgets. Anyone want to keep this capability ?
I personally never used it and I guess we can simply remove it.
Post by t***@onepost.net
During testing sofar I got extremely sick of too many clicks or presses to define and initiate each search operation, so I decided to clean up a bit by rationalising the dialog layout. In particular, I put all search parameters onto a single page of the notebook. There's more scope for smarts in the logic governing the relations between parameters.
Very nice, there is already a bug about cleaning the interface so I am
looking forward to what you changed.

Not sure about the keybindings...

Just out of interest: Do you have an svn account? Because then we can
simply review and ask you to apply the patch yourself, which saves some
work. But please always send a patch to someone for review before you
apply anything. (In the future we might agree to give you that
permission, too, of course).

Thanks and regards,
Johannes
Post by t***@onepost.net
Post by Johannes Schmid
Post by Johannes Schmid
* Conformed data pointers used in the plugin and in search prefs. I suspect the search prefs were not doing anything
Might be true. Maybe we should move the preferences to a page in the
Anjuta preferences to clean up the dialog.
I also thought that, but playing with glade is beyond my current expertise.
Well, I might have a look at that later.
Post by Johannes Schmid
* Added funcs to map reasonably quickly both ways between byte-position and char-position
Interesting. I tried to do something like that but ended up being much
too slow. What did you use?
Cached char-index as well as byte-index for each EOL. Each conversion then needs only to find the nearest reference point and g_utf8... from that point. But I've not evaluated how fast it is for heavy usage. Nor have I yet decided how best to update such cached values when replacing.
The caching could really make the difference in this case.
Post by Johannes Schmid
* implement utf8 form of regex searching
GRegex does this though it's only in the newest GLib. We probably need
some #ifdef'ing but maybe we can disable regex searching completely for
older Glib versions. People would need to update if they want it.
I saw stuff about GRegex but have never yet investigated it. Presumably good to use if available. Otherwise, probably ok to just upgrade PCRE_REQUIRED to 5.0 or so, and set PCRE_UTF8 and (because we will have already validated) PCRE_NO_UTF8_CHECK flags when compiling a regex. The availability of utf-8 support can be discovered at runtime. If it's not actually available, I guess we use the process anyway (as happens now), and hope that a search doesn't involve any fancy treatment of multi-byte chars.
GRegex is a very thin wrapper around pcre (and uses it internally). I
just makes everything a bit more GObject-like but anything else is more
or less the same.
Post by Johannes Schmid
* decide which of the 2 provided versions for both find-next and find-previous are more user-friendly
Which two version do you mean?
There's a pair that looks for the pattern from the most-recent search dialog, and another pair which IIRC looks for current-word or something. I think the latter pair are coded but not included in the UI, probably they're the old incremental search functions. As the seach-box process is more like the latter pair (and now more so, with its easy-repeat capability), I guess I should just disable the code for the second pair.
Can you remember who's working on the build plugin, and tell/ask him/her to substitute F2 (I think that's still available) for the current F3 binding there ?
Hmm, in the moment anjuta uses the same keybindings as gedit by default.
I think we should stick to the GNOME convention here as the shortcuts
can be easily modified in the preferences.
Naba Kumar
2008-01-05 12:58:51 UTC
Permalink
Hi Jhs,
Post by Johannes Schmid
Nice! Well, in my local trunk I completely removed the methods relying
or offsets from ianjuta_editor_selection and replace them by those
taking iterators. Maybe, and if others agree, wait until I can commit
this (it depends on the gtksourceview-patch applied but this will be
done today, hopefully).
Does that mean you have fixed IAnjutaEditor to get rid of 'int
position'? Very nice. Looking forward to it :)
Post by Johannes Schmid
Post by t***@onepost.net
During testing sofar I got extremely sick of too many clicks or presses to define and initiate each search operation, so I decided to clean up a bit by rationalising the dialog layout. In particular, I put all search parameters onto a single page of the notebook. There's more scope for smarts in the logic governing the relations between parameters.
Very nice, there is already a bug about cleaning the interface so I am
looking forward to what you changed.
I agree. The interface is very tedious.
Post by Johannes Schmid
Not sure about the keybindings...
Originally, Anjuta had ctrl+f for search dialog and ctrl+e for
incremental search. IIRC, it got changed during the reimplementation of
i-search. Is ctrl+e being used somewhere else? Perhaps we can use ctrl+k
like gedit does?

Thanks.

Regards,
-Naba



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Johannes Schmid
2008-01-05 13:22:07 UTC
Permalink
Hi Naba!
Post by Naba Kumar
Does that mean you have fixed IAnjutaEditor to get rid of 'int
position'? Very nice. Looking forward to it :)
It's now in trunk, please tell if I broke some things, it was a quite
huge commit.
I didn't get rid of postions completely, I thought it would be better to
do this incrementally. Positions are removed from IAnjutaEditorSelection
and IAnjutaEditorHover for now. I fixed all plugins relying on the
interface but those still use positions internally (debugger hover and
esspecially search). This has yet to be fixed but I am glad the Tom is
taking care of the search issue.

@Seb: I only change the hover code to compile, please fix it once you
find the time.

Regards,
Johannes
Sébastien Granjoux
2008-01-05 18:59:27 UTC
Permalink
Hi,
Post by Johannes Schmid
@Seb: I only change the hover code to compile, please fix it once you
find the time.
Ok, I have changed it to use the new interface. It seems to work fine
but I have fixed 2 bugs in the editor plugin.

1. There were some critical warning due to an invalid position when
emitting the hover signals. I don't know if it's really useful to emit
these signals when the position is invalid. The debugger code doesn't
care and just do nothing. But I think it will be better to not emit
anything if everybody agree.

2. I have fixed the get_text_iter function, to get all characters
between begin and end, including the characters at begin and at end. I
think it's the right behavior according to the documentation no ?
The corresponding message in Scintilla doesn't include the last
character, so I have to do a bit more work.

I haven't test this with gtksourceview because I don't have the latest
version installed so the plugin doesn't compile anymore. Could you check
that it's working ?

Regards,

Sébastien

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Johannes Schmid
2008-01-06 00:15:29 UTC
Permalink
Hi Sébastien!
Post by Sébastien Granjoux
1. There were some critical warning due to an invalid position when
emitting the hover signals. I don't know if it's really useful to emit
these signals when the position is invalid. The debugger code doesn't
care and just do nothing. But I think it will be better to not emit
anything if everybody agree.
I would agree that we shouldn't emit the signal on an invalid position.
But I have to check if this case could happen in GtkSourceView.
Post by Sébastien Granjoux
2. I have fixed the get_text_iter function, to get all characters
between begin and end, including the characters at begin and at end. I
think it's the right behavior according to the documentation no ?
The corresponding message in Scintilla doesn't include the last
character, so I have to do a bit more work.
Sorry that I didn' test that but the interface was no yet used and.
Thanks for fixing.
Post by Sébastien Granjoux
I haven't test this with gtksourceview because I don't have the latest
version installed so the plugin doesn't compile anymore. Could you check
that it's working ?
I will check it. If you want to have a look at the new plugin you need
gtksourceview from svn trunk. It has no other dependencies except GTK+
2.8 and libxml 2.5 which should be already installed.

Thanks and regards,
Johannes
Naba Kumar
2008-01-06 10:54:39 UTC
Permalink
Post by Johannes Schmid
Hi Naba!
Post by Naba Kumar
Does that mean you have fixed IAnjutaEditor to get rid of 'int
position'? Very nice. Looking forward to it :)
It's now in trunk, please tell if I broke some things, it was a quite
huge commit.
I made a pass through the changes and found some missing unrefs for the
iters. I committed the fixes in svn. The attached patch shows where the
missing unrefs have been added.

Thanks.

Regards,
-Naba
t***@onepost.net
2008-01-06 03:45:31 UTC
Permalink
On Sat, 05 Jan 2008 12:38:44 +0100
Johannes Schmid <***@jsschmid.de> wrote:

Johannes I have a bad feeling about one of your proposed changes ...
Post by Johannes Schmid
Post by t***@onepost.net
I think the misaligned-selection problem is fixed. Scintilla expects and reports byte-positions for selection operations, even with utf-8 encoded buffers. But it uses char-positions for setting positions as used in anjuta's iterable-related functions. Since I need to do char<>byte transforms anyway, the least-expensive and fastest way to meet scintilla as well as sourceview needs was to add an editor-interface function to get whether the editor uses byte- or char-positions, and manage the search data accordingly. Probably should add a corresponding function to the iterable interface. While sorting this stuff out I cleaned up and quickened some of the scintilla iterable-functions.
Nice! Well, in my local trunk I completely removed the methods relying
or offsets from ianjuta_editor_selection and replace them by those
taking iterators. Maybe, and if others agree, wait until I can commit
this (it depends on the gtksourceview-patch applied but this will be
done today, hopefully).
I expect this change would be quite negative for scintilla searching, as it would add un-necessary conversions back and forth between char-position and byte-position. In particular, for any in-search-plugin process that uses byte-positions (i.e Boyer-Moore etc), we end up converting from byte- to char- and back to byte- just to set a position for selection or replacement.

Also, we'd need a (quick) process for the scintilla-editor to do such conversion - at least from the iterator's char-position to the corresponding byte-position needed by the SCI_SETSEL function. And to be entirely consistent, we'd also want to convert SCI_GETSELECTIONSTART etc return-values to char-positions before handing them back to the caller. Playing with g_utf8_*() and strlen() and a grabbed text copy (for both start and end iters) is terribly inefficient.

My recommendation is that you leave the offset-related functions in place after adding iter-related functions.
Post by Johannes Schmid
Post by t***@onepost.net
I can't see any point in the "basic search" option which is selectable from the settings page in a search dialog. Essentially all it does is hide the scope-related widgets. Anyone want to keep this capability ?
I personally never used it and I guess we can simply remove it.
Post by t***@onepost.net
During testing sofar I got extremely sick of too many clicks or presses to define and initiate each search operation, so I decided to clean up a bit by rationalising the dialog layout. In particular, I put all search parameters onto a single page of the notebook. There's more scope for smarts in the logic governing the relations between parameters.
Very nice, there is already a bug about cleaning the interface so I am
looking forward to what you changed.
Not sure about the keybindings...
Regards
Tom

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Naba Kumar
2008-01-06 09:30:46 UTC
Permalink
Hi Tom,
Post by t***@onepost.net
I expect this change would be quite negative for scintilla searching,
as it would add un-necessary conversions back and forth between
char-position and byte-position. In particular, for any
in-search-plugin process that uses byte-positions (i.e Boyer-Moore
etc), we end up converting from byte- to char- and back to byte- just
to set a position for selection or replacement.
byte to char position conversion in scintilla is not a problem in, as
long as it's not done in tremendous amount of times. char to byte
conversion is however very inefficient, as you pointed out.

But, you shouldn't need so much char to byte conversions.
Post by t***@onepost.net
Also, we'd need a (quick) process for the scintilla-editor to do such
conversion - at least from the iterator's char-position to the
corresponding byte-position needed by the SCI_SETSEL function.
The scintilla implementation of the iters uses byte position inside, so
anything inside scintilla is done all with byte position from the iter's
(priv) data. And everything outside uses the iters.
Post by t***@onepost.net
And to be entirely consistent, we'd also want to convert
SCI_GETSELECTIONSTART etc return-values to char-positions before
handing them back to the caller.
Not really. The caller only gets the iters and it (the external caller)
should try to use the iters as much as possible. Converting to char
position isn't a big at that point, however it should still try to avoid
any hard numbers for char positions.

What's most important is, converting the iters to byte-positions should
be avoided at all costs outside the editor plugins.

Are you saying that you require to use byte-positions outside the editor
plugins no mater what?

Thanks.

Regards,
-Naba



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
t***@onepost.net
2008-01-06 22:48:37 UTC
Permalink
On Sun, 06 Jan 2008 11:30:46 +0200
Naba Kumar <***@gmail.com> wrote:

Hello all
Post by Naba Kumar
Post by t***@onepost.net
I expect this change would be quite negative for scintilla searching,
as it would add un-necessary conversions back and forth between
char-position and byte-position. In particular, for any
in-search-plugin process that uses byte-positions (i.e Boyer-Moore
etc), we end up converting from byte- to char- and back to byte- just
to set a position for selection or replacement.
byte to char position conversion in scintilla is not a problem in, as
long as it's not done in tremendous amount of times. char to byte
conversion is however very inefficient, as you pointed out.
But, you shouldn't need so much char to byte conversions.
Post by t***@onepost.net
Also, we'd need a (quick) process for the scintilla-editor to do such
conversion - at least from the iterator's char-position to the
corresponding byte-position needed by the SCI_SETSEL function.
The scintilla implementation of the iters uses byte position inside, so
anything inside scintilla is done all with byte position from the iter's
(priv) data. And everything outside uses the iters.
Post by t***@onepost.net
And to be entirely consistent, we'd also want to convert
SCI_GETSELECTIONSTART etc return-values to char-positions before
handing them back to the caller.
Not really. The caller only gets the iters and it (the external caller)
should try to use the iters as much as possible. Converting to char
position isn't a big at that point, however it should still try to avoid
any hard numbers for char positions.
What's most important is, converting the iters to byte-positions should
be avoided at all costs outside the editor plugins.
Are you saying that you require to use byte-positions outside the editor
plugins no mater what?
Yes.

Perhaps a sort of diagram would clarify things. I'll use these symbols:
B byte-position
C char-position
Post by Naba Kumar
conversion
U upstream, in editor plugin
D downstream, in search plugin
TE scintilla editor
SV sourceview editor

search-position/range info provided by editor-native functionality

TE = B
SV = C

select/replace info wanted by editor (same as above, of course)

search process may use B or C ("advanced", pcre with/without utf8, editor-native B or C, etc?)

So for the old search process we get

range search operation

TE == B == TE
TE D:B>C C D:C>B TE
SV D:C>B B D:B>C SV
SV == C == SV

Last time I checked, scintilla iterable positioning (as distinct from getting selection etc positions) was based on SCI_POSITIONBEFORE or SCI_POSITIONAFTER, and AFAIK these are the only scintilla position-functions that use C, not B. To avoid confusion, I guess you'd want iterable positions to always be of one type, not context-specific B or C.

After the iter changes, and assuming that iterable positions are always C

range search operation

TE U:B>C,D:C>B B D:B>C,U:C>B TE
TE U:B>C C U:C>B TE
SV D:C>B B D:B>C SV
SV == C == SV

So this needs conversion code & processes upstream as well as downstream, searching/replacing is often slower for scintilla, but there's no saving for any other combination.

Or if scintilla iterable positioning is always to be B, there's still replicated conversion code for the search process, and also more process and possibly more code in other context(s).

Hence my previous recommendation for retaining the current capabilities.

C>B as was (is?) performed in editor plugins is _obviously_ inefficient. Without checking how scintilla performs B>C, I'd guess that it's similarly bad, because there would not have been much expected need for conversion, and so not much attention to its performance. My downstream conversions use cached line-data to speed conversions, but it's still not great for long files and long lines. The lesson for the day is ... minimise conversions !

Regards
Tom

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2005.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Naba Kumar
2008-01-09 09:05:23 UTC
Permalink
Hi Tom,
range search operation
TE U:B>C,D:C>B B D:B>C,U:C>B TE
TE U:B>C C U:C>B TE
SV D:C>B B D:B>C SV
SV == C == SV
From this, I understand you intend to do the search in integer index
(either B or C). That is sadly not going to be efficient at all. Either
scintilla or sourceview is going to hit the conversion bottleneck, as
you pointed out above.

If you can implement the search algorithm to use the iters directly as
indexes, that would prevent any conversions around. Although there is
some performance overhead in iter navigation (i.e. it's not as straight
forward as integer +/-), thankfully it will be linear (for both editors)
and I believe it can handle 10K lines of file easily without annoyance
(yet to be seen, though). auto-indent code already does it that way,
sometimes with multiple passes.

If you use the algorithm in B or C integer indexes and use iters only
for intermidiate fetching mechanism, you would end up with conversions
in the middle of the algorithm and that would be very expensive. This is
exactly what I suggested to avoid.

Now, I don't really now how the algorithm works, but can you confirm if
it is doable using iters instead of any integer indexes? In theory, it
should be possible although it would a bit more involved.

In short, the most expensive API call is ianjuta_iterable_set(int
position) /* the position is in C index */ _only_ in scintilla. Or
equivalent ianjuta_iterable_get_text_iter(int position). As long as you
avoid calling them, everything should be fine. At most try not to call
it more than once (e.g. only at the end or beginning of the search).

Here is the a rough cost association of each conversion.

Another rather not-so-good approach, especially if you can't avoid using
integer index, would be to cache the whole editor text in your plugin's
buffer and perform the search there. That way, you can efficiently use
integer indexes.

Please note that eventually IAnjutaEditor interface (and its
derivatives) will not have have any int index API at all, either C or B,
so everything has to adapt to work with iters only. Only a couple of
methods will be there to set/get the iters with C indexes, mainly for
initialization in odd situations where indexes arrive from external
sources.

Thanks.

Regards,
-Naba
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Johannes Schmid
2008-01-09 17:36:16 UTC
Permalink
Hi all!

I think using iterators for the search algorithm is not a good idea
because editors can search through ianjuta_editor_search_* interface
much more efficient.

Problem is we have three cases:
- File is open in editor: We can use the search interface
- File is not open in editor(regex/non-regex): We need to search in the
file. Here an algorithm working on bytes should be efficient and as we
only show the line when the user clicks the result it won't cause any
wrong aligned selection and no conversion is necessary. We can use
pcre/GRegex for regex searches here.
- Regex search (File is open in editor):
- Scintilla:
Can do this natively, we could extend ianjuta_editor_search()
- GtkSourceView:
Can do this natively yet though it's planned. Here, we have a
problem.

Of course it's not very nice to have multiple algorithms for mostly the
same operation but I guess that's nearly unavoidable. But until now, I
have no solution for regex search in GtkSourceView.

Regards,
Johannes
Post by Naba Kumar
Hi Tom,
range search operation
TE U:B>C,D:C>B B D:B>C,U:C>B TE
TE U:B>C C U:C>B TE
SV D:C>B B D:B>C SV
SV == C == SV
From this, I understand you intend to do the search in integer index
(either B or C). That is sadly not going to be efficient at all. Either
scintilla or sourceview is going to hit the conversion bottleneck, as
you pointed out above.
If you can implement the search algorithm to use the iters directly as
indexes, that would prevent any conversions around. Although there is
some performance overhead in iter navigation (i.e. it's not as straight
forward as integer +/-), thankfully it will be linear (for both editors)
and I believe it can handle 10K lines of file easily without annoyance
(yet to be seen, though). auto-indent code already does it that way,
sometimes with multiple passes.
If you use the algorithm in B or C integer indexes and use iters only
for intermidiate fetching mechanism, you would end up with conversions
in the middle of the algorithm and that would be very expensive. This is
exactly what I suggested to avoid.
Now, I don't really now how the algorithm works, but can you confirm if
it is doable using iters instead of any integer indexes? In theory, it
should be possible although it would a bit more involved.
In short, the most expensive API call is ianjuta_iterable_set(int
position) /* the position is in C index */ _only_ in scintilla. Or
equivalent ianjuta_iterable_get_text_iter(int position). As long as you
avoid calling them, everything should be fine. At most try not to call
it more than once (e.g. only at the end or beginning of the search).
Here is the a rough cost association of each conversion.
Another rather not-so-good approach, especially if you can't avoid using
integer index, would be to cache the whole editor text in your plugin's
buffer and perform the search there. That way, you can efficiently use
integer indexes.
Please note that eventually IAnjutaEditor interface (and its
derivatives) will not have have any int index API at all, either C or B,
so everything has to adapt to work with iters only. Only a couple of
methods will be there to set/get the iters with C indexes, mainly for
initialization in odd situations where indexes arrive from external
sources.
Thanks.
Regards,
-Naba
t***@onepost.net
2008-01-10 23:49:03 UTC
Permalink
On Wed, 09 Jan 2008 11:05:23 +0200
Naba Kumar <***@gmail.com> wrote:

Naba, I think I've been too cryptic in my comments about using iters during search/replace operations.

Actually, I'm quite indifferent to using iters per se, as I expect that the additional overhead to create and destroy the objects, and to set and retrieve the position-information for the iters, will be insignificant in the overall operation.

Of course an iter is essentially a vehicle for conveying position information and it's that information, not the iter, that's ultimately needed for an operation. As previously mentioned, some local searching uses char-positions (C), some uses byte-positions (B) whether or not the text actually includes multi-byte character(s). If we were to use iters for local searching, it would simply shift the points of B<>C conversions, but not affect any of the issues I've raised.

Historically, and sofar I've not seen net benefit in changing this, a local copy of the searched text is always created during a search (one via a search-dialog or a list-all-matches operation, not via a search-box operation). For investigation purposes at least, I've recently converted from simple local "brute-force" searching to editor-native searching when reasonable (no regex etc etc). So I'm aleady using iters for that.

BTW setting search-range iter position in the code I have is _always_ C, while reported match position is B for scintilla or C for sourceview - what fun! Handling all this was the fix for the mis-alignment bug in bugzilla etc. Recent svn updates may have reconciled this difference ?

Without wanting to rant, if iters for scintilla always represent B, and iters for sourceview always represent C, then it seems to me that an iter is abstraction for its own sake, and why bother?. And for scintilla we can't use things like iter-next, because you may go to the middle of a character.

OTOH, if iters always represent a C, then with scintilla some serious B<>C conversions are needed, as most SCI_* position-func's use B, and some s/r likewise uses B. I've not investigated the usage of iters (or offsets) elsewhere in anjuta.

So in summary, for s/r at least there's no benefit, but no great cost, in using iters to set/get relevant positions _if_ the iters always reflect editor-native positioning. If iters are always C, then Houson, we have a problem ...

Again I recommend:
1. get/set position-information with least-possible B<>C conversion, which requires the capability to communicate using editor-native positioning
2. for s/r usage at least, keep the API which communicates positions directly, because using iters gains nothing
Also:
3. if scintilla iters are to be always B, and not already done, then the relevant set-position function needs to omit SCI_POSITIONBEFORE and/or SCI_POSITIONAFTER.

Regards
Tom
Post by Naba Kumar
range search operation
TE U:B>C,D:C>B B D:B>C,U:C>B TE
TE U:B>C C U:C>B TE
SV D:C>B B D:B>C SV
SV == C == SV
From this, I understand you intend to do the search in integer index
(either B or C). That is sadly not going to be efficient at all. Either
scintilla or sourceview is going to hit the conversion bottleneck, as
you pointed out above.
If you can implement the search algorithm to use the iters directly as
indexes, that would prevent any conversions around. Although there is
some performance overhead in iter navigation (i.e. it's not as straight
forward as integer +/-), thankfully it will be linear (for both editors)
and I believe it can handle 10K lines of file easily without annoyance
(yet to be seen, though). auto-indent code already does it that way,
sometimes with multiple passes.
If you use the algorithm in B or C integer indexes and use iters only
for intermidiate fetching mechanism, you would end up with conversions
in the middle of the algorithm and that would be very expensive. This is
exactly what I suggested to avoid.
Now, I don't really now how the algorithm works, but can you confirm if
it is doable using iters instead of any integer indexes? In theory, it
should be possible although it would a bit more involved.
In short, the most expensive API call is ianjuta_iterable_set(int
position) /* the position is in C index */ _only_ in scintilla. Or
equivalent ianjuta_iterable_get_text_iter(int position). As long as you
avoid calling them, everything should be fine. At most try not to call
it more than once (e.g. only at the end or beginning of the search).
Here is the a rough cost association of each conversion.
Another rather not-so-good approach, especially if you can't avoid using
integer index, would be to cache the whole editor text in your plugin's
buffer and perform the search there. That way, you can efficiently use
integer indexes.
Please note that eventually IAnjutaEditor interface (and its
derivatives) will not have have any int index API at all, either C or B,
so everything has to adapt to work with iters only. Only a couple of
methods will be there to set/get the iters with C indexes, mainly for
initialization in odd situations where indexes arrive from external
sources.
Thanks.
Regards,
-Naba
-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Johannes Schmid
2008-01-11 08:36:33 UTC
Permalink
Hi!
Post by t***@onepost.net
Without wanting to rant, if iters for scintilla always represent B, and iters for sourceview always represent C, then it seems to me that an iter is abstraction for its own sake, and why bother?. And for scintilla we can't use things like iter-next, because you may go to the middle of a character.
OTOH, if iters always represent a C, then with scintilla some serious B<>C conversions are needed, as most SCI_* position-func's use B, and some s/r likewise uses B. I've not investigated the usage of iters (or offsets) elsewhere in anjuta.
All iterators (scintilla and sourceview) represent C, otherwise they
would be useless. next() means to go forward a whole utf8-character.

Scintilla docs:
SCI_POSITIONBEFORE(int position)
SCI_POSITIONAFTER(int position)
These messages return the position before and after another position in
the document taking into account the current code page. The minimum
position returned is 0 and the maximum is the last position in the
document. If called with a position within a multi byte character will
return the position of the start/end of that character.

Regards,
Johannes
Naba Kumar
2008-01-12 00:21:09 UTC
Permalink
Hi Tom,
Post by t***@onepost.net
Without wanting to rant, if iters for scintilla always represent B,
and iters for sourceview always represent C, then it seems to me that
an iter is abstraction for its own sake, and why bother?. And for
scintilla we can't use things like iter-next, because you may go to
the middle of a character.
That's not the case. The point of iter abstraction is that everyone
outside sees the same -- irrespective of what editor is behind. It is
basically, a C based navigation. Next/prev does C +1/-1 for any editor.
We choose C because we expect everybody to use utf-8 throughout Anjuta
(and everybody does apparently).

This has the interesting side effect that some implementation has weird
way of fulling this API's contract. For example, the horrid performance
in ianjuta_iter_set_position (int position) in case of scintilla... In
addition to some overhead of using iters.

Even in case of sourceview, we don't really know what's inside, because
it has its own iter abstraction. Is it C, B or somehow both? Only their
developers know. For us, what matters is that we should successfully
abstract their abstraction.

Now, why did we decide to go with iter abstraction despite
uneventualities like these. The simple answer is: consistency. There
were some discussion in past about this and that's where we agreed to
solely go with iter interface. It all started because we had enough
troubles (and there still are more [1]) with the same 'int position'
somehow meaning two completely different things. The consistency was
needed and we were ready to pay the price.

If I understand correctly, your proposal is to keep the 'int position'
API and have some sort of flag to identify if it's C or B, depending on
the editor implementation.

While I believe that's one way to solve this issue, it is clearly not an
elegant way. We will end up with 3 different perspectives of the same
API and it puts the load of context-check/sanity-check in everybody's
plate trying to use it. There are even cases where you could have both
instances of the editors running (different IAnjutaEditor objects) [2].
Trying to cross-use some code between the two could be disastrous.

That said, we clearly can't get rid of every 'int position' API from the
interface. There would still be some get/set(int) APIs to initiate the
iters. In most real cases, iters should originate entirely from inside
of the editor plugins (signals emitted, current position, selection
start, selection end, beginning of file, end of file etc.), so life is
cool and no need for dreaded int->iter APIs. However, there would still
be some cases where we might need them, such as jumping to an error
location, setting breakpoint etc; conversion required there.
Post by t***@onepost.net
OTOH, if iters always represent a C, then with scintilla some serious
B<>C conversions are needed, as most SCI_* position-func's use B, and
some s/r likewise uses B. I've not investigated the usage of iters (or
offsets) elsewhere in anjuta.
Let me explain how it is currently implemented. There really is no
serious conversions like you are worried.

In scintilla, The iter class (called IAnjutaEditorCell) implements the
IAnjutaIterable interface (which has next/prev API, among others). The
class has private member 'int byte_position'. This is what is used
throughout inside the editor plugin. Nothing in that plugin uses C index
anywhere. Now, this 'byte_position' is *not* the 'int position' you see
from outside the interface. In fact, there is no C index stored anywhere
in the iter class. It's calculated on-the-fly on need basis (that's why
ianjuta_iterable_set/get is expensive for this editor).

When the iter class is instantiated, the byte_position of course has to
be initialized. Luckily, in most cases the iter is instantiated from
existing known positions (current position, selection start, selection
end etc.) so you know the byte_position value to initialize. In the
get/set() methods, you get to use the conversions. Bad, but we live. We
just try to avoid them as much as possible.

The iter operations then is just a matter of fast-forwarding/reversing
the internal byte_position in utf-8 steps. That stepping is done using
scintilla commands SCI_POSITIONBEFORE(int byte_position) and
SCI_POSITIONAFTER(int byte_position).

In case of sourceview, the internal representation is GtkTextIter (or
something similar) which is more or less similar to our IAnjutaIterable
interface. Johannes knows better.
Post by t***@onepost.net
So in summary, for s/r at least there's no benefit, but no great cost,
in using iters to set/get relevant positions _if_ the iters always
reflect editor-native positioning. If iters are always C, then Houson,
we have a problem ...
As I explained above, it is C from outside IAnjutaIterable interface and
whatever-the-most-efficient-representation (native representation) from
inside. Both worlds live happily :)
Post by t***@onepost.net
1. get/set position-information with least-possible B<>C conversion,
which requires the capability to communicate using editor-native
positioning
Unfortunately, they are going to remain inefficient because of inherent
issue with C<>B conversions and we cannot avoid conversions at all. We
are not the only one suffering with it (e.g. g_utf8_*). I am sure
GtkTextBuffer has similar problem where it 'recommends' using certain
APIs over others to avoid using the conversion-laden APIs.

At some point I even thought about fixing scintilla to maintain an index
(along side the text buffer) to make C<>B conversion easy, but again
thought, may be it's just a matter of using the current API smartly.
Then I didn't bother to think any further. If things turn ugly later, I
wouldn't mind revisiting the idea.
Post by t***@onepost.net
2. for s/r usage at least, keep the API which communicates positions
directly, because using iters gains nothing
Not recommended by me.
Post by t***@onepost.net
3. if scintilla iters are to be always B, and not already done, then
the relevant set-position function needs to omit SCI_POSITIONBEFORE
and/or SCI_POSITIONAFTER.
I already clarified that part.

About your search/replace implementation, perhaps Johannes's suggestion
to implement it in each editor, is the way to go? That way, you can take
full advantage of native code, and because of that I wouldn't really
call it duplicated effort. Even though the algorithms are same, they are
being implemented in the most efficient way. As a bonus, you don't need
to worry about conversions :).

Thanks.

Regards,
-Naba

[1] If you open file containing utf-8 chars and scroll somewhere around
the end of the file and use autoindent (ctrl-i), you will notice weird
results.

[2] If you first load a file with scintilla plugin active, go to prefs,
enable sourceview plugin and load another file, you will end up with two
editor objects: one from scintilla and other from sourceview.



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
t***@onepost.net
2008-01-12 04:50:01 UTC
Permalink
On Sat, 12 Jan 2008 02:21:09 +0200
Naba Kumar <***@gmail.com> wrote:

Folks, this discussion has gone off the rails. From my perspective that's due to two things:

1. ianjuta_editor_search_forward|backward() use iters to set search range and receive match positions, and for scintilla those are all byte-positions because the SCI_FINDTEXT function is not codepage-specific.
Post by Naba Kumar
The scintilla implementation of the iters uses byte position inside, so
anything inside scintilla is done all with byte position from the iter's
(priv) data. And everything outside uses the iters.
But now we're back on track about iters representing char-position, the way forward is clear and simple:

1. ianjuta_editor_search_forward|backward() must not use iters, to avoid the conversion penalties discussed in previous posts.

2. because text-selection is such a fundamental part of the s/r process, and we want to minimise B<>C conversions there too, many of the ianjuta_editor_selection_*() funcs need to not use iters, or else, there need to be supplementary funcs that do not use iters.

Easy, really !

See also my further comments mingled with the parent message, below.

Regards
Tom
Post by Naba Kumar
This has the interesting side effect that some implementation has weird
way of fulling this API's contract. For example, the horrid performance
in ianjuta_iter_set_position (int position) in case of scintilla... In
addition to some overhead of using iters.
That approach (IIRC it was a horrid iter-next loop from start of file), is only relevant if you want to set an iter position using a byte-position. In practice, why would you want to do that? If the backend function that uses the iter information requires B, you sure don't want to convert B>C to store the iter position and then C>B again to do whatever is intended. If the backend function uses C, you would be in some context where C is reasonably available (iter_get_position, iter_clone, iter_next|previous, iter_first|last etc). Essentially the same point you present, below.

IMHO we should be consistent - iters represent C, so set them as such.
Post by Naba Kumar
If I understand correctly, your proposal is to keep the 'int position'
API and have some sort of flag to identify if it's C or B, depending on
the editor implementation.
Most definitely yes! I've already implemented this, here. Very small and simple change.
Post by Naba Kumar
While I believe that's one way to solve this issue, it is clearly not an
elegant way. We will end up with 3 different perspectives
Well, 2 in the case of selection, but we can entirely forget iters for searching so just 1 "perspective" there, and likewise for any other functionality (maybe some of the examples you list below) that benefits from a similar approach to s/r.

of the same
Post by Naba Kumar
API and it puts the load of context-check/sanity-check in everybody's
plate trying to use it.
Precisely so. That's exactly what _is_ needed in order to avoid the performance penalties. Just one function-call to decide ...

There are even cases where you could have both
Post by Naba Kumar
instances of the editors running (different IAnjutaEditor objects) [2].
Trying to cross-use some code between the two could be disastrous.
Nope. The code needs to check which type to use, and unless that type were cached (wrongly assuming that the editor can't change) then it will work properly every time.
Post by Naba Kumar
Let me explain how it is currently implemented. There really is no
serious conversions like you are worried.
In scintilla, The iter class (called IAnjutaEditorCell) implements the
IAnjutaIterable interface (which has next/prev API, among others). The
class has private member 'int byte_position'. This is what is used
throughout inside the editor plugin. Nothing in that plugin uses C index
anywhere. Now, this 'byte_position' is *not* the 'int position' you see
from outside the interface. In fact, there is no C index stored anywhere
in the iter class. It's calculated on-the-fly on need basis (that's why
ianjuta_iterable_set/get is expensive for this editor).
That's just the sort of thing I expected, since code-page support came along when scintilla was already mature. IMO such on-the-fly conversion _is_ very likely to be a performance issue which passes through to anj performance, unless such conversion is more efficient than I fear.
Post by Naba Kumar
In case of sourceview, the internal representation is GtkTextIter (or
something similar) which is more or less similar to our IAnjutaIterable
interface. Johannes knows better.
Yes, I'm well familiar with GtkTextIter and GtkTextBuffer. I've developed a small editor for another (gtk-only) application.
Post by Naba Kumar
Unfortunately, they are going to remain inefficient because of inherent
issue with C<>B conversions and we cannot avoid conversions at all. We
are not the only one suffering with it (e.g. g_utf8_*). I am sure
GtkTextBuffer has similar problem where it 'recommends' using certain
APIs over others to avoid using the conversion-laden APIs.
At some point I even thought about fixing scintilla to maintain an index
(along side the text buffer) to make C<>B conversion easy, but again
thought, may be it's just a matter of using the current API smartly.
Sorry, but no, that's not an efficient solution.
Post by Naba Kumar
About your search/replace implementation, perhaps Johannes's suggestion
to implement it in each editor, is the way to go?
Editor-native searching already exists, and it can meet some our our needs. As mentioned in a previous message, I'm using that now, where relevant. Porting the whole s/r process to each editor seems to me to be a vast waste.

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Johannes Schmid
2008-01-12 18:04:58 UTC
Permalink
Hi Tom!

First, thanks for all the work you have done on this search topic. But
we have to find a solution that also handles future use cases and that
requires some discussion.
Post by t***@onepost.net
1. ianjuta_editor_search_forward|backward() use iters to set search range and receive match positions, and for scintilla those are all byte-positions because the SCI_FINDTEXT function is not codepage-specific.
Post by Naba Kumar
The scintilla implementation of the iters uses byte position inside, so
anything inside scintilla is done all with byte position from the iter's
(priv) data. And everything outside uses the iters.
1. ianjuta_editor_search_forward|backward() must not use iters, to avoid the conversion penalties discussed in previous posts.
Where would such a conversion occur?
Post by t***@onepost.net
2. because text-selection is such a fundamental part of the s/r process, and we want to minimise B<>C conversions there too, many of the ianjuta_editor_selection_*() funcs need to not use iters, or else, there need to be supplementary funcs that do not use iters
Why should in any case, the SCI_FINDTEXT function ever return a position
that is not the begin/end of an utf-8 character? So there shouldn't be
any conversion necessary.

If ianjuta_editor_search() are incorrectly implemented in scintilla,
it's my fault, but it is still a bug in the editor and nothing that is
wrong conceptionally. As far as I checked the returned iterators are
compeletely ok.

Of course, the get_position() method of those iterators will return a
different value as get_position() of a gtksourceview iterator, but
get_position() is editor specific anyway. That's why we want to get rid
of position and just use the iterators.

Regards,
Johannes
t***@onepost.net
2008-01-13 11:49:00 UTC
Permalink
On Sat, 12 Jan 2008 19:04:58 +0100
Johannes Schmid <***@jsschmid.de> wrote:

Johannes, there still seems to be misunderstanding in this discussion.
Post by Johannes Schmid
First, thanks for all the work you have done on this search topic. But
we have to find a solution that also handles future use cases and that
requires some discussion.
Post by t***@onepost.net
1. ianjuta_editor_search_forward|backward() use iters to set search range and receive match positions, and for scintilla those are all byte-positions because the SCI_FINDTEXT function is not codepage-specific.
Post by Naba Kumar
The scintilla implementation of the iters uses byte position inside, so
anything inside scintilla is done all with byte position from the iter's
(priv) data. And everything outside uses the iters.
1. ianjuta_editor_search_forward|backward() must not use iters, to avoid the conversion penalties discussed in previous posts.
Where would such a conversion occur?
When I hunt through the scintilla API documentation, the only position-related functions that make reference to code-page, and therefore (in anjuta) work with char-positions instead of byte-positions, are SCI_POSITIONBEFORE and SCI_POSITIONAFTER. The scintilla-editor iter-related functions (iter-first|next etc) use those functions. All of the other scintilla-editor functions which get/set positions e.g. selection-related, find-related, have to give or accept byte-positions to/from scintilla. I can vouch for this in relation to finding and selection, on the basis of my many hours of stepping through and fixing the operation of s/r code.

So if you wish to use iters to interface with such functionality, and we're agreed that iters represent char-positions, then it's necessary for each such scintilla function to convert C>B inside the editor plugin.

If outside-editor process e.g. s/r operation, needs to work with byte-positions, the communication process becomes

operation <> iter <> scintilla
B <> C <> B

This double-trouble is the concern that I've been referring to for a while. But it's not just a s/r matter.
Post by Johannes Schmid
Post by t***@onepost.net
2. because text-selection is such a fundamental part of the s/r process, and we want to minimise B<>C conversions there too, many of the ianjuta_editor_selection_*() funcs need to not use iters, or else, there need to be supplementary funcs that do not use iters
Why should in any case, the SCI_FINDTEXT function ever return a position
that is not the begin/end of an utf-8 character? So there shouldn't be
any conversion necessary.
It will of course report the begin/end of whatever you ask it to look for. It will report in byte-positions. So presumably you could ask it to look for partial chars if you were so inclined.

BTW, pcre is the same. Even when supporting utf-8 chars, all of the positions it reports are byte-positions.
Post by Johannes Schmid
If ianjuta_editor_search() are incorrectly implemented in scintilla,
it's my fault,
Only insofar as you appear to have assumed that scintilla _always_ uses C for positions, and so you do not do the conversion from an iter's C to scintilla's B and vice versa. Actually this is a good outcome in many cases, or else you'd have all that wasted conversion activity. You get away with it when it's ok to ignore the nature of the iter data on both ends of the process, e.g. in the search-box where you find using an iter and then use that same data to select. In both cases the data transfers are actually B. I've recently been using this for search-box searches too. I've looked into it very closely.

but it is still a bug in the editor and nothing that is
Post by Johannes Schmid
wrong conceptionally.
Well, if the concept is that we should use C-related iter information for all interfacing and at any cost, then I'd say the concept is wrong.

As far as I checked the returned iterators are
Post by Johannes Schmid
compeletely ok.
They presently work in many circumstances, as described above.
Post by Johannes Schmid
Of course, the get_position() method of those iterators will return a
different value as get_position() of a gtksourceview iterator, but
get_position() is editor specific anyway.
I don't understand this. There should be no difference between char-positions reported by the different editors for a given piece of text.

That's why we want to get rid
Post by Johannes Schmid
of position and just use the iterators.
BTW, I'm now planning to reduce the risk for s/r, by abandoning editor-native searching altogether, and reverting to the local searching that was always used before. However, the issue under discussion must still be addressed, because there's no avoiding the editors' selection and marking processes, and they have the same characteristics.

Regards
Tom

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Johannes Schmid
2008-01-13 12:09:28 UTC
Permalink
Hi Tom!
Post by t***@onepost.net
Post by Johannes Schmid
Of course, the get_position() method of those iterators will return a
different value as get_position() of a gtksourceview iterator, but
get_position() is editor specific anyway.
I don't understand this. There should be no difference between char-positions reported by the different editors for a given piece of text.
I think this is the main reason for our misunderstanding. Iterators
represent a position that is save to be used and that is always between
two utf-8 characters as long as you use the iterator as is or with
next/prev() and never use it's internal "position".

The "position" reported by the iterator is something that is just some
internal representation and in theorie entirely private to the iterator.
So using set/get_position is just an ugly migration path and we should
in the end completely avoid it.

But I agree that there is a point where we cannot avoid it. That is,
when we get a byte/char-position from some external source (e.g a file)
and want to display this in the editor. That would mean we have to do
some conversion and properly get/set_position() in scintilla should take
care of that even if it performs bad. It should not be used often.

Regards,
Johannes
Johannes Schmid
2008-01-13 15:43:29 UTC
Permalink
Hi!
* Once (and if there's a selection, another two-times) in each forward or backward search, to get the position from which to start.
* Twice to get a text-range whenever searching in a selection or function or block.
* If editor-native searching is used, twice to set the search range and twice to get the match range.
* Twice to select each match, if that's the search operation.
* Twice for each skipped replacement or four times for each confirmed replacement.
* Twice for each marked match.
Cannot follow you here, sorry. For me, using native-editor searching,
the following algorithm is used:

- ianjuta_editor_search_forward(): takes two iters, either from
ianjuta_editor_selection_get_start/end() or whole document (NULL, NULL)
or current cursor position (ianjuta_editor_get_position_iter()) => No
conversion
- ianjuta_editor_selection_select(): takes the iters returned by
search_forward() => No conversion
- ianjuta_editor_selection_replace() if appropriate which does not need
any position information
- might be repeated

Where do you need any C=>B or B=>C conversion here.
ianjuta_iterable_set/get_position() is not called once here.

I only see a problem, when we search directly on a file and want to set
the returned position (e.g by pcre) in the editor.

Regards,
Johannes
t***@onepost.net
2008-01-17 22:27:08 UTC
Permalink
On Sun, 13 Jan 2008 16:43:29 +0100
Post by Johannes Schmid
* Once (and if there's a selection, another two-times) in each forward or backward search, to get the position from which to start.
* Twice to get a text-range whenever searching in a selection or function or block.
* If editor-native searching is used, twice to set the search range and twice to get the match range.
* Twice to select each match, if that's the search operation.
* Twice for each skipped replacement or four times for each confirmed replacement.
* Twice for each marked match.
Cannot follow you here, sorry. For me, using native-editor searching,
- ianjuta_editor_search_forward(): takes two iters, either from
ianjuta_editor_selection_get_start/end() or whole document (NULL, NULL)
or current cursor position (ianjuta_editor_get_position_iter()) => No
conversion
- ianjuta_editor_selection_select(): takes the iters returned by
search_forward() => No conversion
- ianjuta_editor_selection_replace() if appropriate which does not need
any position information
- might be repeated
Where do you need any C=>B or B=>C conversion here.
ianjuta_iterable_set/get_position() is not called once here.
I only see a problem, when we search directly on a file and want to set
the returned position (e.g by pcre) in the editor.
Yes indeed, it's quite ok to grab a pair of pre-cooked iters, either from the several reportable positions (current, selection, start, end) or from one of the current search functions, and merely hand those iters over for processing without care about what they contain. That is, not much is being done - as in a search-box.

But more generally, it's not so simple. In any repeated operation (say, bookmark all matches) we need to reset at least one end of the search range each time, and since the backend doesn't know which one, both ends will need to be reset for each repetition. For any search where we want to do something based on what was found (in a regex search) or where it was found (e.g. get whole line of text for that point), we need to get position-info from the search function if the "something" can't be done by the editor (or by the editor interface) directly.

I was thinking more about a previous comment, along the lines of iter position would ideally not be set by downstream code, instead it would only be set by an editor. I'd forgotten about gtk_text_iter_set_offset(). So I'm reminded that it's acceptable for code to set a position directly. Based on our discussion about what an anjuta iter contains, ianjuta_iterable_set_position() should exactly reflect gtk_text_iter_set_offset() for all editors. With character-position, and conversion(s) of that when used with most scintilla functions and some in-search functions. And in that case, I still want a work-around ...

Regards
Tom

-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Johannes Schmid
2008-01-17 23:30:56 UTC
Permalink
Hi all!
Post by t***@onepost.net
I was thinking more about a previous comment, along the lines of iter position would ideally not be set by downstream code, instead it would only be set by an editor. I'd forgotten about gtk_text_iter_set_offset(). So I'm reminded that it's acceptable for code to set a position directly. Based on our discussion about what an anjuta iter contains, ianjuta_iterable_set_position() should exactly reflect gtk_text_iter_set_offset() for all editors. With character-position, and conversion(s) of that when used with most scintilla functions and some in-search functions. And in that case, I still want a work-around ...
Naba, could you please take care that set/get_position() is implemented
correctly for scintilla? You can maybe use
g_utf8_offset_to_pointer/pointer_to_offset in that case. IMHO this is
the only clean solution for positions.

Regards,
Johannes
Naba Kumar
2008-01-18 16:23:14 UTC
Permalink
Hi,
Post by Johannes Schmid
Hi all!
Post by t***@onepost.net
I was thinking more about a previous comment, along the lines of
iter position would ideally not be set by downstream code, instead it
would only be set by an editor. I'd forgotten about
gtk_text_iter_set_offset(). So I'm reminded that it's acceptable for
code to set a position directly. Based on our discussion about what an
anjuta iter contains, ianjuta_iterable_set_position() should exactly
reflect gtk_text_iter_set_offset() for all editors. With
character-position, and conversion(s) of that when used with most
scintilla functions and some in-search functions. And in that case, I
still want a work-around ...
Current ianjuta_iterable_get/set() in scintilla takes/returns the
correct utf8 offset already. It's not efficient but it tries to return
accurately. What else did you mean?
Post by Johannes Schmid
Naba, could you please take care that set/get_position() is
implemented
correctly for scintilla? You can maybe use
g_utf8_offset_to_pointer/pointer_to_offset in that case. IMHO this is
the only clean solution for positions.
Tt would require copying full buffer every time (we already do that
partially), since scintilla does not provide raw access to its buffer. I
am not very confident this would solve the problem entirely.

What I have been trying to come up is to have similar conversion API for
scintilla to avoid copying anything and to avoid iterating it from
outside. It makes it a tad bit better.

Thanks.

Regards,
-Naba



-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
t***@onepost.net
2008-01-19 02:10:04 UTC
Permalink
On Fri, 18 Jan 2008 18:23:14 +0200
Naba Kumar <***@gmail.com> wrote:

Some thoughts, Naba.

These are based on svn content from a while back, so if things have changed, please ignore ...

iiter_set_position() has that horrid iter_next loop. I can't see why not just set the position directly with SCI_POSITIONBEFORE (for position 0) or SCI_POSITIONAFTER, and check to see that the resultant value is what was wanted. But I've never checked what happens if you were to try to set a position past the end of text. IIRC it hates positions < 0.

iiter_get_position() grabs text and converts it using g_utf8_strlen(), implying that the position stored in the iter is bytes, not chars. It should just return cell->priv->position.

iiter_last() ultimately uses SCI_GETLENGTH, which is bytes (BTW, after investigation I've concluded that in the scintilla documentation "number of characters" means "number of bytes" unless there is specific reference to the function taking account of code-page).

Otherwise, Johannes' point really relates to all the editor-interface functions which use SCI_* position-funcs other than SCI_POSITIONBEFORE/AFTER (i.e. all except iiter_set_position/first/next/previous). As you've mentioned, in each case, you'd want a text-dump plus g_utf8_strlen() or g_utf8_pointer_to_offset() to populate a returned iter from scintilla value, or for data supplied to the SCI function from an iter, text dump then (g_utf8_offset_to_pointer - dump). Sucks.

Regards
Tom
Post by Naba Kumar
Post by Johannes Schmid
Hi all!
Post by t***@onepost.net
I was thinking more about a previous comment, along the lines of
iter position would ideally not be set by downstream code, instead it
would only be set by an editor. I'd forgotten about
gtk_text_iter_set_offset(). So I'm reminded that it's acceptable for
code to set a position directly. Based on our discussion about what an
anjuta iter contains, ianjuta_iterable_set_position() should exactly
reflect gtk_text_iter_set_offset() for all editors. With
character-position, and conversion(s) of that when used with most
scintilla functions and some in-search functions. And in that case, I
still want a work-around ...
Current ianjuta_iterable_get/set() in scintilla takes/returns the
correct utf8 offset already. It's not efficient but it tries to return
accurately. What else did you mean?
Post by Johannes Schmid
Naba, could you please take care that set/get_position() is
implemented
correctly for scintilla? You can maybe use
g_utf8_offset_to_pointer/pointer_to_offset in that case. IMHO this is
the only clean solution for positions.
Tt would require copying full buffer every time (we already do that
partially), since scintilla does not provide raw access to its buffer. I
am not very confident this would solve the problem entirely.
What I have been trying to come up is to have similar conversion API for
scintilla to avoid copying anything and to avoid iterating it from
outside. It makes it a tad bit better.
Thanks.
Regards,
-Naba
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
_______________________________________________
Anjuta-devel mailing list
https://lists.sourceforge.net/lists/listinfo/anjuta-devel
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/

Naba Kumar
2008-01-14 08:07:34 UTC
Permalink
Hi,
Post by Johannes Schmid
That would mean we have to do
some conversion and properly get/set_position() in scintilla should take
care of that even if it performs bad. It should not be used often.
That's the intention. If it's not done yet, that's because I was lazy.

Thanks.

Regards,
-Naba



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Johannes Schmid
2008-01-15 18:24:33 UTC
Permalink
Hi!

Well, lets keep it to the point that we need a working search for 2.4
and that the current code has some problems (including some crashers).

Tom, do you have a working patch somewhere in your queue, no matter how
it internally works for now? I really like the cleaned-up search dialog
but it has to go in before UI freeze.

Thanks,
Johannes
Post by Naba Kumar
Hi,
Post by Johannes Schmid
That would mean we have to do
some conversion and properly get/set_position() in scintilla should take
care of that even if it performs bad. It should not be used often.
That's the intention. If it's not done yet, that's because I was lazy.
Thanks.
Regards,
-Naba
t***@onepost.net
2008-01-17 21:32:18 UTC
Permalink
On Tue, 15 Jan 2008 19:24:33 +0100
Post by Johannes Schmid
Well, lets keep it to the point that we need a working search for 2.4
and that the current code has some problems (including some crashers).
Tom, do you have a working patch somewhere in your queue, no matter how
it internally works for now? I really like the cleaned-up search dialog
but it has to go in before UI freeze.
Really, the code now looks sufficiently different that a patch would be bigger than dropping in the replacement code.

I've been testing as I go, but am only now starting on the "replace in all project files" stuff.

I could send the .glade file, it probably wouldn't work though, due to different callbacks.

PROGRESS REPORT In recent days I have:

* refactored all the s/r front-end code to virtually eliminate the use of static data, mainly to allow usage of any/multiple s/r data struct or object, but it would also now be trivial to have multiple s/r dialogs if wanted (say, with multiple editor windows)
* added an action to clear all match-marks - previously this was only possible by searching for and marking text that's not there
* significantly refreshed the "saved searches" code - to make it work :); enable it to log and apply search and/or replace strings; to make it easier to apply and modify saved searches by allowing the saved search tab to be seen and used (in a dropped window) at the same time as the prefs tab; de-asciify all string comparisons; plug about 30 leaks.

Now coming to the end of the hard stuff. The last piece of the mosaic is dealing with s/r in "not opened files". I'll post on that separately.

BTW, deactivate_plugin() in the s/r plugin does not get called upon plugin unload/session end. That's not a big deal before, but now I'd like to clean up the non-static data. How do I fix this ?

Regards
Tom
Post by Johannes Schmid
Post by Naba Kumar
Hi,
Post by Johannes Schmid
That would mean we have to do
some conversion and properly get/set_position() in scintilla should take
care of that even if it performs bad. It should not be used often.
That's the intention. If it's not done yet, that's because I was lazy.
Thanks.
Regards,
-Naba
-------------------------------------------------------------------------
This SF.net email is sponsored by: Microsoft
Defy all challenges. Microsoft(R) Visual Studio 2008.
http://clk.atdmt.com/MRT/go/vse0120000070mrt/direct/01/
Naba Kumar
2008-01-13 20:45:37 UTC
Permalink
Hi Tom,
Post by t***@onepost.net
Johannes, there still seems to be misunderstanding in this discussion.
Okay. Assuming we all understand more or less how the interface is going
to work both internally and externally, let's end the discussion on
this. Here are the facts put together:

1) 'Advanced search' is not going to be easy using iter interface, if
not impossible.

2) Having integer API in addition to iter API is one way to fix it, but
this is agreeably not ideal as it compromises the sanity of libajuta
API. It is important for us to maintain consistency and sanity of the
API because it's a critical point of Anjuta.

3) We need to come up with an alternative solution to implement 'advance
search'. I am sure something can be worked out, good or bad. As long as
the bad parts remains confined within the plugins themselves, we should
be fine.

Thanks.

Regards,
-Naba



-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
t***@onepost.net
2008-01-11 00:37:39 UTC
Permalink
On Wed, 09 Jan 2008 11:05:23 +0200
Naba Kumar <***@gmail.com> wrote:

Hi Naba,
Post by Naba Kumar
Now, I don't really now how the algorithm works, but can you confirm if
it is doable using iters instead of any integer indexes? In theory, it
should be possible although it would a bit more involved.
What I'm calling "advanced" search is characterised by pre-analysis of the wanted string ("needle"), to determine rule(s) on how far to skip whenever a mis-match between the needle and the target string ("haystack") is found.

There are some implications:

* Between skips, it does byte-by-byte comparison of needle and haystack. I guess the process could be adapted to compare and skip whole characters, but IMO the complexity cost is not worth doing this. Not to mention figuring out how to adapt the pre-analysis and skip rules.

* ATM I think it's not worth trying to use it for case-insensitive searching. I may relent on that.

* The skip rules impose a minimum practical needle length, say 4 bytes.

* Given the pre-analysis overhead, there's a minimum practical haystack length, ATM I use 200 bytes.

* Some algorithms don't work in backwards direction.

So, even without regex searcning, there would always be a role for some sort of "simple" search, which might use a local buffer or use editor-native searching. ATM I'm evaluating the latter, it seems ok but not yet tested for speed with lots of files and lots of matches.

I've coded Knuth-Morris-Pratt algorithm for advanced search, but I intend to replace that. Still not exactly sure what is most appropriate for anj.

Regards
Tom

-------------------------------------------------------------------------
Check out the new SourceForge.net Marketplace.
It's the best place to buy or sell services for
just about anything Open Source.
http://ad.doubleclick.net/clk;164216239;13503038;w?http://sf.net/marketplace
Loading...