Discussion:
stack ws workspace for pcre and others #1576
Nils Goroll
2014-09-02 17:12:33 UTC
Permalink
Hi,

on the weekend I have actually stumbled over the same cause that bug #1576 has
(but unfortunately have not immediately realized the connection) and have
changed our dcs_classifier vmod (which needs some reasonably large scratch
memory in the order of ~64k) to use the client workspace rather than stack.

Using the workspace seemed more reasonable as varnish core structures have been
moved away from the stack and I thought it would be more advantagous to increase
a workspace size than increase the stack size (which has a nice tiny default of
48k now).
This is actually related to the thread_pool_stack size.
Bumping the value to 49k make the test work. Lowering pcre_match_limit
makes the match fail without crashing.
Should we reduce the pcre_match_limit to something more sensible? 10000
seems like a rather large value.
If we were to need a larger default stack for pcre anyway, it wouldn't make
sense to move scratch space to a WS.

On the other hand, we might as well consider to give pcre space from workspaces:

PCRESTACK(3):

Compiling PCRE to use heap instead of stack for pcre[16]_exec()

In environments where stack memory is constrained, you might want to
compile PCRE to use heap memory instead of stack for remember‐
ing back-up points when pcre[16]_exec() is running. This makes it run
a lot more slowly, however. Details of how to do this are
given in the pcrebuild documentation. When built in this way, instead of
using the stack, PCRE obtains and frees memory by calling
the functions that are pointed to by the pcre[16]_stack_malloc and
pcre[16]_stack_free variables. By default, these point to mal‐
loc() and free(), but you can replace the pointers to cause PCRE to use
your own functions. Since the block sizes are always the
same, and are always freed in reverse order, it may be possible to
implement customized memory handlers that are more efficient
than the standard functions.

So we could simply hand the free space of some ws to pcre via some
pcre_stack_(malloc|free).


Whatever we do, we should either try to minimize ws usage or stack usage, but do
it consequently for all components.

Nils
Federico Schwindt
2014-09-02 17:56:05 UTC
Permalink
Only problem is that it's a build option.

Increasing the stack space will only work as long as the regex doesn't
recurse more than the stack space allows.

Not sure what's the best solution here but we should avoid crashing.
Post by Nils Goroll
Hi,
on the weekend I have actually stumbled over the same cause that bug #1576 has
(but unfortunately have not immediately realized the connection) and have
changed our dcs_classifier vmod (which needs some reasonably large scratch
memory in the order of ~64k) to use the client workspace rather than stack.
Using the workspace seemed more reasonable as varnish core structures have been
moved away from the stack and I thought it would be more advantagous to increase
a workspace size than increase the stack size (which has a nice tiny default of
48k now).
This is actually related to the thread_pool_stack size.
Bumping the value to 49k make the test work. Lowering pcre_match_limit
makes the match fail without crashing.
Should we reduce the pcre_match_limit to something more sensible? 10000
seems like a rather large value.
If we were to need a larger default stack for pcre anyway, it wouldn't make
sense to move scratch space to a WS.
Compiling PCRE to use heap instead of stack for pcre[16]_exec()
In environments where stack memory is constrained, you might want to
compile PCRE to use heap memory instead of stack for remember‐
ing back-up points when pcre[16]_exec() is running. This makes it run
a lot more slowly, however. Details of how to do this are
given in the pcrebuild documentation. When built in this way, instead of
using the stack, PCRE obtains and frees memory by calling
the functions that are pointed to by the pcre[16]_stack_malloc and
pcre[16]_stack_free variables. By default, these point to mal‐
loc() and free(), but you can replace the pointers to cause PCRE to use
your own functions. Since the block sizes are always the
same, and are always freed in reverse order, it may be possible to
implement customized memory handlers that are more efficient
than the standard functions.
So we could simply hand the free space of some ws to pcre via some
pcre_stack_(malloc|free).
Whatever we do, we should either try to minimize ws usage or stack usage, but do
it consequently for all components.
Nils
_______________________________________________
varnish-dev mailing list
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
Nils Goroll
2014-09-02 18:02:13 UTC
Permalink
Post by Federico Schwindt
Only problem is that it's a build option.
Yes. But I don't see this as a disadvantage. I think we need to make a decision.
Federico Schwindt
2014-09-02 18:06:36 UTC
Permalink
What you mean not a disadvantage? Are you by any means suggesting to bundle
pcre with Varnish?

There is also this somewhat worrisome comment in the docs: PCRE runs
noticeably more slowly when built in this way.

I personally don't consider this an option at all.
Post by Nils Goroll
Post by Federico Schwindt
Only problem is that it's a build option.
Yes. But I don't see this as a disadvantage. I think we need to make a decision.
Nils Goroll
2014-09-02 19:24:51 UTC
Permalink
What you mean not a disadvantage? Are you by any means suggesting to bundle pcre
with Varnish?
If it saved considerable amounts of per-thread memory then I'd consider the option.

Or, alternatively, realize that we need a larg-ish stack anyway and (re)adjust
the rest of the code following this insight.
There is also this somewhat worrisome comment in the docs: PCRE runs noticeably
more slowly when built in this way.
I thought the docs were referring to pcre_malloc and pcre_free pointing to
malloc() and free(), respectively and I'd expect the overhead to be not that
significant if we implemented pcre_malloc as moving the free pointer of a
workspace (pcre_free probably could be a noop).
Nils Goroll
2014-09-02 19:34:53 UTC
Permalink
Post by Nils Goroll
Or, alternatively, realize that we need a larg-ish stack anyway
For all those who have not yet read pcrestack(3):

As a very rough rule of thumb, you should reckon on about 500 bytes per
recursion. Thus, if you want to limit your stack usage to
8Mb, you should set the limit at 16000 recursions. A 64Mb stack, on the
other hand, can support around 128000 recursions.

pcre_match_limit_recursion 10000 (default)
thread_pool_stack 48k [bytes] (default)

So these obviously do not fit at all.

We we opt for keeping pcre on the stack, we should consider to auto-tune either
to sensible values.

Nils
Federico Schwindt
2014-09-02 20:10:57 UTC
Permalink
I don't think you're considering all the implications.

Bundling something is no small feat, specially when bugs are found. The
libjemalloc problem is a good example.

Leaving that aside the NO_RECURSIVE code in pcre is not the default. This
code is likely much less exercised than the recursive one and bugs might be
lurking there. FWIW the last entry I can find in the changelog is from Dec
2010.

But most importantly this would be shifting the problem from the stack to
the workspace. This doesn't fix the problem, just makes it easier or
harder, we don't know, to reproduce.
Post by Federico Schwindt
What you mean not a disadvantage? Are you by any means suggesting to
bundle pcre
Post by Federico Schwindt
with Varnish?
If it saved considerable amounts of per-thread memory then I'd consider the option.
Or, alternatively, realize that we need a larg-ish stack anyway and (re)adjust
the rest of the code following this insight.
Post by Federico Schwindt
There is also this somewhat worrisome comment in the docs: PCRE runs
noticeably
Post by Federico Schwindt
more slowly when built in this way.
I thought the docs were referring to pcre_malloc and pcre_free pointing to
malloc() and free(), respectively and I'd expect the overhead to be not that
significant if we implemented pcre_malloc as moving the free pointer of a
workspace (pcre_free probably could be a noop).
Nils Goroll
2014-09-02 20:19:27 UTC
Permalink
Post by Federico Schwindt
I don't think you're considering all the implications.
let me pull this straight: No, I did not mean to say we should pull in pcre. I
only wanted to start a discussion about what is the best option, I have not
thought through all implications. Again, I am not a pcre expert.

I did mean to say though that if we wanted to minimize the stack size, we should
consider options to get memory for pcre from elsewhere.


And at this point it seems like keeping pcre on the stack is the best option, so
if we need a larg-ish stack anyway for pcre, I think we should re-think some of
the code in order to make good use of that (and possibly save memory elsewhere).

Also we definitely need to provide better defaults for
pcre_match_limit_recursion and thread_pool_stack
Poul-Henning Kamp
2014-09-02 20:38:13 UTC
Permalink
--------
Post by Nils Goroll
Post by Federico Schwindt
I don't think you're considering all the implications.
let me pull this straight: No, I did not mean to say we should pull in pcre.
Pulling in PCRE to move things onto the heap doesn't solve the problem.

In fact it will probably increase our per-thread memory foot-print
more than increasing the stack will.

Move on, nothing to see here...
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk-***@public.gmane.org | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Poul-Henning Kamp
2014-09-02 19:45:19 UTC
Permalink
--------

I think moving pcre from stack to heap would hurt our performance
more than increasing the thread stack size.

What we really need is for pcre to tell us an estimate of how much
stack-space is needed during compilation of the REs, so that we
can refuse vcl.use if there isn't enough thread_stack for it.
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk-***@public.gmane.org | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Nils Goroll
2014-09-02 19:50:32 UTC
Permalink
Post by Poul-Henning Kamp
What we really need is for pcre to tell us an estimate of how much
stack-space is needed during compilation
Is this possible? My understanding is that the recursion depth depends on the
data and thus all we can do is set a recursion limit for the stack size we have
available.

Nils
Poul-Henning Kamp
2014-09-02 19:51:49 UTC
Permalink
--------
Post by Nils Goroll
Post by Poul-Henning Kamp
What we really need is for pcre to tell us an estimate of how much
stack-space is needed during compilation
Is this possible? My understanding is that the recursion depth depends on the
data and thus all we can do is set a recursion limit for the stack size we have
available.
You tell me, I'm no PCRE expert...
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk-***@public.gmane.org | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Nils Goroll
2014-09-02 19:53:31 UTC
Permalink
Post by Poul-Henning Kamp
You tell me, I'm no PCRE expert...
Neither am I :(
Nils Goroll
2014-09-02 20:08:20 UTC
Permalink
Post by Nils Goroll
Post by Poul-Henning Kamp
What we really need is for pcre to tell us an estimate of how much
stack-space is needed during compilation
Is this possible? My understanding is that the recursion depth depends on the
data and thus all we can do is set a recursion limit for the stack size we have
available.
This sounds close to the answer we are looking for:

https://lists.exim.org/lurker/message/20130417.155604.5223f8b8.en.html
Post by Nils Goroll
Is there a way to get some "stats" for a regexp like max recursion
depth, taken branches, steps needed for solution, ...?
The pcretest program has a facility for determining the max recursion
depth, given a pattern and a subject string. Grep for \M in the pcretest
man page.

pcretest(1):

If \M is present, pcretest calls pcre_exec() several times, with different
values in the match_limit and match_limit_recursion fields of the pcre_extra
data structure, until it finds the minimum numbers for each parameter that allow
pcre_exec() to complete.
Federico Schwindt
2014-09-02 23:56:32 UTC
Permalink
Fix aside should we consider using sigaltstack to also catch these crashes?
I imagine something like the attached but allocated on startup and shared
among all workers.

This might be specific to pcre at the moment but I can see its usefulness
in the vmod world.
Post by Nils Goroll
Post by Nils Goroll
Post by Poul-Henning Kamp
What we really need is for pcre to tell us an estimate of how much
stack-space is needed during compilation
Is this possible? My understanding is that the recursion depth depends
on the
Post by Nils Goroll
data and thus all we can do is set a recursion limit for the stack size
we have
Post by Nils Goroll
available.
https://lists.exim.org/lurker/message/20130417.155604.5223f8b8.en.html
Post by Nils Goroll
Is there a way to get some "stats" for a regexp like max recursion
depth, taken branches, steps needed for solution, ...?
The pcretest program has a facility for determining the max recursion
depth, given a pattern and a subject string. Grep for \M in the pcretest
man page.
If \M is present, pcretest calls pcre_exec() several times, with different
values in the match_limit and match_limit_recursion fields of the pcre_extra
data structure, until it finds the minimum numbers for each parameter that allow
pcre_exec() to complete.
_______________________________________________
varnish-dev mailing list
https://www.varnish-cache.org/lists/mailman/listinfo/varnish-dev
Poul-Henning Kamp
2014-09-03 08:13:35 UTC
Permalink
--------
In message <CAJV_h0ZfT-BwxxLf-ddz1Lb5C+W0SKvFNkKQS79CDMM9ANdnWA-JsoAwUIsXosN+***@public.gmane.org>
, Federico Schwindt writes:

+#ifdef HAVE_SIGALTSTACK
+ if (cache_param->sigsegv_handler) {
+ char stackbuf[MINSIGSTKSZ];
+ stack_t ss;
+
+ ss.ss_size = sizeof(stackbuf);
+ ss.ss_sp = stackbuf;
+ ss.ss_flags = 0;
+ (void)sigaltstack(&ss, NULL);
+ }
+#endif

This will ad 2-4 KB to all threads.

Also, I'm not convinced about the wisdom of allocating the sigstack
on the normal stack...
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
phk-***@public.gmane.org | TCP/IP since RFC 956
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Federico Schwindt
2014-09-03 15:37:09 UTC
Permalink
As discussed on irc the space can be allocated and shared among all threads
instead of using the stack.
I've tested it this morning and works fine. In both cases the stack trace
was present.

Without this code if we overflow the stack we just get a coredump.

Regardless of this we should lower the pcre limits for the time being based
on the information in pcrestack.
Post by Poul-Henning Kamp
--------
In message <
+#ifdef HAVE_SIGALTSTACK
+ if (cache_param->sigsegv_handler) {
+ char stackbuf[MINSIGSTKSZ];
+ stack_t ss;
+
+ ss.ss_size = sizeof(stackbuf);
+ ss.ss_sp = stackbuf;
+ ss.ss_flags = 0;
+ (void)sigaltstack(&ss, NULL);
+ }
+#endif
This will ad 2-4 KB to all threads.
Also, I'm not convinced about the wisdom of allocating the sigstack
on the normal stack...
--
Poul-Henning Kamp | UNIX since Zilog Zeus 3.20
FreeBSD committer | BSD since 4.3-tahoe
Never attribute to malice what can adequately be explained by incompetence.
Loading...