Problem with cache files being deleted

Discussion:

Problem with cache files being deleted

(too old to reply)

Kime Philip

2014-12-15 20:15:13 UTC

I am the maintainer of biber, a backend for the latex biblatex system which is distributed using PAR::Packer. This works fine but regularly we get an error report from users that after a while using the tool, a particular file disappears from the cache and the cache has to be deleted and the binary unpacked again. I have never worked out why this is and it has been a problem for a few years, through many PP versions. You can see the general issue reported here:

http://tex.stackexchange.com/questions/140814/biblatex-biber-fails-with-a-strange-error-about-missing-recode-data-xml-file

From what I have seen, when this file is missing, all of the .pm files in inc/ are missing too although I cannot reliably reproduce this. It is a very frequent error from many users on various platforms and I’d very much like to find out why it happens …

PK
--
Dr Philip Kime

Roderich Schupp

2014-12-16 10:02:52 UTC

...regularly we get an error report from users that after a while using
the tool, a particular file disappears from the cache and the cache has to
be deleted and the binary unpacked again. I have never worked out why this
is and it has been a problem for a few years, through many PP versions. You
http://tex.stackexchange.com/questions/140814/biblatex-biber-fails-with-a-strange-error-about-missing-recode-data-xml-file
From what I have seen, when this file is missing, all of the .pm files in
inc/ are missing too although I cannot reliably reproduce this. It is a
very frequent error from many users on various platforms and Iâd very much
like to find out why it happens âŠ

I'm fairly confident that PAR::Packer doesn't meddle with the files in the
cache area of
a pp-packed executable. AFAICT there's nothing in the code to delete files
there (except
to purge the whole cache area immediatly after running the executable).

At least for *nix installations I can think of an explanation:
administrative cron jobs that clean up
/tmp, /var/tmp etc by purging files not modified for some time. Dunno for
Windows, esp.
since the "temp" directories there are typically specific to each user.

Cheers, Roderich

Kime Philip

2014-12-16 10:06:15 UTC

Yes, that’s what I though, just scheduled /tmp cleanup etc. but in fact it seems to happen more often on non-Unix …
Certainly some cases were due to users Ctrl-Cing a first .exe run because they thought it was taking too long (normal for it to take longer on first run of a new version of course, due to unpacking) and then the cache building was incomplete. However, most reports are, they say, “suddenly” it no longer works and complains about missing files. I can’t really explain it. Deleting the cache and running again always fixes it.

PK

Post by Kime Philip
http://tex.stackexchange.com/questions/140814/biblatex-biber-fails-with-a-strange-error-about-missing-recode-data-xml-file
From what I have seen, when this file is missing, all of the .pm files in inc/ are missing too although I cannot reliably reproduce this. It is a very frequent error from many users on various platforms and I’d very much like to find out why it happens …
I'm fairly confident that PAR::Packer doesn't meddle with the files in the cache area of
a pp-packed executable. AFAICT there's nothing in the code to delete files there (except
to purge the whole cache area immediatly after running the executable).
At least for *nix installations I can think of an explanation: administrative cron jobs that clean up
/tmp, /var/tmp etc by purging files not modified for some time. Dunno for Windows, esp.
since the "temp" directories there are typically specific to each user.
Cheers, Roderich

--
Dr Philip Kime

Shawn Laffan

2014-12-16 10:29:56 UTC

I had a similar case recently where a user ran CCleaner on the temp
folder while the application was open. It deleted everything in the
cache folder that was not file locked, so the cache folder still exists,
along with several of the component files.

A bit of testing showed the .pm and other perl files are re-extracted
the next time the PAR is run, but none of the files packed using -a (an
icon file and a glade file in my case).

I ended up adding code to the script to manually check for file
existence and, if needed, unpacking from the PAR. It's not a solution
that will scale well, but I only have two such files. Example code is
below, but PAR::par_handle is probably simpler now I look at the docs again.

For the more general case, is there a simple means to trigger the
re-extraction of all files packed using -a?

Regards,
Shawn.

###
require Archive::Zip;
my $icon = Path::Class::file ($ENV{PAR_TEMP}, 'inc', 'Biodiverse_icon.ico');
my $icon_str = $icon->stringify;

my $folder = $icon->dir;
my $fname = $icon->basename;
my $zip = Archive::Zip->new($ENV{PAR_PROGNAME})
or die "Unable to open $ENV{PAR_PROGNAME}";

my $success = $zip->extractMember ( $fname, $icon_str );

Post by Kime Philip
Yes, that’s what I though, just scheduled /tmp cleanup etc. but in fact it seems to happen more often on non-Unix …
Certainly some cases were due to users Ctrl-Cing a first .exe run because they thought it was taking too long (normal for it to take longer on first run of a new version of course, due to unpacking) and then the cache building was incomplete. However, most reports are, they say, “suddenly” it no longer works and complains about missing files. I can’t really explain it. Deleting the cache and running again always fixes it.
PK

Post by Kime Philip
http://tex.stackexchange.com/questions/140814/biblatex-biber-fails-with-a-strange-error-about-missing-recode-data-xml-file
From what I have seen, when this file is missing, all of the .pm files in inc/ are missing too although I cannot reliably reproduce this. It is a very frequent error from many users on various platforms and I’d very much like to find out why it happens …
I'm fairly confident that PAR::Packer doesn't meddle with the files in the cache area of
a pp-packed executable. AFAICT there's nothing in the code to delete files there (except
to purge the whole cache area immediatly after running the executable).
At least for *nix installations I can think of an explanation: administrative cron jobs that clean up
/tmp, /var/tmp etc by purging files not modified for some time. Dunno for Windows, esp.
since the "temp" directories there are typically specific to each user.
Cheers, Roderich

--
Dr Philip Kime

Roderich Schupp

2014-12-16 10:49:24 UTC

Post by Shawn Laffan
For the more general case, is there a simple means to trigger the
re-extraction of all files packed using -a?

Simply delete the cache area :)
Sorry, for .pm and "glue" .dll files the check is automatic as PAR
intercepts module loading
and DynaLoader anyway. But all stuff packed with -a is accessed using
open() calls - I wouldn't want PAR to replace CORE::open.

Cheers, Roderich

Shawn Laffan

2014-12-16 11:16:01 UTC

Well, that's definitely one solution :)

I'm not sure how well that would go when done from within a PAR
executable, since that would presumably lock its source files, and thus
the cache will not be deleted. Doing it within the executable is
simplest for the users, many of whom are unfamiliar with the temp folder
and its contents and tend to steer clear.

What I'm thinking of is some way for a PAR executable to detect if extra
files packed using -a are missing, and then unpack them. Handling of dll
and glue files can remain as-is (it seems they are re-extracted in any
case). Extra files could be explicitly checked for in the script and
then something like a PAR::reextract_extra_packed_files sub could
extract them. Possibly it could handle the existence checks as well.
Essentially it is the same approach I gave in my previous email, but
applied across all extra files.

If I get some pointers to the relevant part of the code base then I
could take a stab at it, without any promises of success of course.

Or have I misunderstood your point about open(), and that's how they are
currently extracted?

Regards,
Shawn.

On Tue, Dec 16, 2014 at 11:29 AM, Shawn Laffan
For the more general case, is there a simple means to trigger the
re-extraction of all files packed using -a?
Simply delete the cache area :)
Sorry, for .pm and "glue" .dll files the check is automatic as PAR
intercepts module loading
and DynaLoader anyway. But all stuff packed with -a is accessed using
open() calls - I wouldn't want PAR to replace CORE::open.
Cheers, Roderich

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis

UNSW CRICOS Provider Code 00098G

Roderich Schupp

2014-12-16 13:04:39 UTC

Post by Shawn Laffan
What I'm thinking of is some way for a PAR executable to detect if extra
files packed using -a are missing, and then unpack them.

Such a check would defeat the "if the cache area exists, skip extracting
stuff" logic, so you'll
take a hit when starting up. Also there's a (somewhat crude) lock against
concurrent
unpacking of the cache area that you have to keep in mind.

But most of all: the code base is very brittle and prone to breaking stuff
at a distance,
hence I myself won't touch certain parts unless it's absolutely necessary.
But if you want to try your hand at hacking: the code that extracts all
packed files
(excpet DLLs) into <cache area>/inc is sub _extract_inc in PAR.pm

Cheers, Roderich

Shawn Laffan

2014-12-16 20:27:49 UTC

Thanks Roderich.

Given the brittleness I'll aim for a separate sub that would be
explicitly called by the packed script.

Now to find the time to do it...

Regards,
Shawn.

On Tue, Dec 16, 2014 at 12:16 PM, Shawn Laffan
What I'm thinking of is some way for a PAR executable to detect if
extra files packed using -a are missing, and then unpack them.
Such a check would defeat the "if the cache area exists, skip
extracting stuff" logic, so you'll
take a hit when starting up. Also there's a (somewhat crude) lock
against concurrent
unpacking of the cache area that you have to keep in mind.
But most of all: the code base is very brittle and prone to breaking
stuff at a distance,
hence I myself won't touch certain parts unless it's absolutely necessary.
But if you want to try your hand at hacking: the code that extracts
all packed files
(excpet DLLs) into <cache area>/inc is sub _extract_inc in PAR.pm
Cheers, Roderich

Shawn Laffan

2014-12-16 20:29:07 UTC

Thanks Ron,

The issue is that these files are usually not additional libraries. They
are typically extra files for icons, config, data, and the like. That
said, they don't tend to have .pm or .pl extensions so maybe this could
work.

I'll keep it in mind as I try the explicit approach.

Regards,
Shawn.

On Tue, Dec 16, 2014 at 6:16 AM, Shawn Laffan
What I'm thinking of is some way for a PAR executable to detect if
extra files packed using -a are missing, and then unpack them.
I just had a thought for a work-around.
When pp script.pl <http://script.pl> is run, the script is "statically
scanned" for dependencies.
When pp -c script.pl <http://script.pl> is run, after the static scan,
the script is compiled (like perl -c script.pl <http://script.pl>). I
am thinking that %INC is then inspected for additional libraries.
So, my idea is, after identifying all the extra files to be included
with -a, go back to the main .pl file and add a BEGIN block that adds
those library names to %INC. Then run "pp -c main.pl <http://main.pl>"
- without any -a options - then verify the created executable runs
correctly.
As I said, I see this as a work-around.
A better solution would be for pp to save the names of all the files
included via -a just as it saves the ones it detects.

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis

UNSW CRICOS Provider Code 00098G

Shawn Laffan

2014-12-16 21:35:26 UTC

On Tue, Dec 16, 2014 at 3:29 PM, Shawn Laffan
The issue is that these files are usually not additional
libraries. They are typically extra files for icons, config,
data, and the like. That said, they don't tend to have .pm or .pl
extensions so maybe this could work.
Yes, I've had to use -a for icons, config files, etc. The new thing
(to me) is the partial removal of cached files.
It just occurred to me that %INC was the obvious variable that would
get updated by "perl -c script.pl <http://script.pl>". Of course, pp's
scanner could be adding its own "instrumentation", but if that's the
case, should not be too hard to see how that works. And then determine
if a BEGIN block in the script being "scanned" can safely insert
additional files.
I did do an experiment. I added a BEGIN block to an existing script to
add an arbitrary file to %INC. As best I can determine, Perl did not
care. The script ran normally, including all the modules it uses.
I will experiment more. If this does work, then the script will
contain the list of other resources used both for generating the
executable and for re-extracting when necessary, at run time.

PAR does add itself to @INC at run time (see example below), so maybe
adding a packed data folder to %INC, or the end of @INC to avoid name
clashes, in a BEGIN block the script would be a viable workaround.
Please do post the results.

(BTW, your replies aren't going to the list).

@INC when running a par exe:
C:\user\svn\bd_releases\biodiverse_0.99_007_win64\lib CODE(0x3289f10)
C:\Users\user\AppData\Local\Temp\par-736861776e\cache-b078428bc6c4b872664f757140eea943a5fb10a3\inc\lib
C:\Users\user\AppData\Local\Temp\par-736861776e\cache-b078428bc6c4b872664f757140eea943a5fb10a3\inc
CODE(0x2f57670) CODE(0x2f57bc8)

Regards,
Shawn.

Roderich Schupp

2014-12-17 09:12:22 UTC

clashes, in a BEGIN block the script would be a viable workaround. Please
do post the results.

That won't work. @INC is relevant for _loading modules_, ie. "require
something" (or "use something", as this impies the former). PAR does
indeed add itself to @INC, by adding some directories in the cache area and
by adding a CODE ref that can load modules from the packed executable (the
zip part) itself.
But all icons, data files etc are read by the modules themselves using
standard file reading ops.
PAR cannot detect this (and hence intercept this). Depending on how these
modules determine the filesystem location of their data (unfortunately
there's no universally adopted approach here) these calls _may_
succeed if the data files have been explicitly packed (with -a).
Module::ScanDeps also cannot detect this data when it parses your script
and all required modules.
It has list of well known modules and their data, though, for example all
the Unicode data is packed and extracted to the correct location.

Cheers, Roderich

Shawn Laffan

2014-12-17 23:18:44 UTC

Thanks for clarifying Roderich.

I've had a look at sub PAR::_extract_inc and have a plan for a separate
sub which a script can call to re-extract all the files packed using -a.

It boils down to checking for par_tmp/inc/MANIFEST and unzipping it if
needed. None of the -a packed files are listed in the manifest, so we
can then iterate over the inc directory in the zip archive and, if a
file or folder is not in the manifest and is not in par_tmp/inc, then it
can be re-extracted.

If this passes the sanity check then I'll code something up.

Regards,
Shawn.

On Tue, Dec 16, 2014 at 10:35 PM, Shawn Laffan
avoid name clashes, in a BEGIN block the script would be a viable
workaround. Please do post the results.
something" (or "use something", as this impies the former). PAR does
area and by adding a CODE ref that can load modules from the packed
executable (the zip part) itself.
But all icons, data files etc are read by the modules themselves using
standard file reading ops.
PAR cannot detect this (and hence intercept this). Depending on how
these modules determine the filesystem location of their data
(unfortunately there's no universally adopted approach here) these
calls _may_
succeed if the data files have been explicitly packed (with -a).
Module::ScanDeps also cannot detect this data when it parses your
script and all required modules.
It has list of well known modules and their data, though, for example
all the Unicode data is packed and extracted to the correct location.
Cheers, Roderich

Roderich Schupp

2014-12-18 07:39:40 UTC

Post by Shawn Laffan
It boils down to checking for par_tmp/inc/MANIFEST and unzipping it if
needed. None of the -a packed files are listed in the manifest,

Ooops, that might be considered a bug...

Cheers, Roderich

Roderich Schupp

2014-12-18 08:02:47 UTC

Here are two other ideas for a workaround for "some cleanup program purged
some files in my cache area".

The rearguard hack:

- make sure extracted files get the timestamp of extraction, not the
timestamp recorded in the zip archive
- add a dummy file, say REARGUARD.txt, at the top of the cache area
- extract it as the first file (so that it becomes the oldest extracted
file)
- at packed executable startup, check that REARGUARD.txt exists,
otherwise re-extract all files

Different location for the cache area:

- when the packed executable foo.exe is installed as /some/path/foo.exe,
extract to /some/path/foo.unpacked if this directory can be created
- otherwise fall back to current behaviour, ie. extract to $TMP/par-USER
- I'm not sure about the security implications

Cheers, Roderich

Shawn Laffan

2014-12-18 21:22:52 UTC

The rearguard idea is appealing.

The extractions currently use Archive::Unzip::Burst as the first
attempt. I don't know its behaviour, but I assume that if some files
exist and are locked then it will fail. In that event the approach
switches to iterating over $zip->memberNames, so it will still work.
(Actually, Archive::Unzip::Burst seems not to install on Windows, so the
latter approach will be the norm on that OS).

Of course, this won't fix the issues where files are cleaned up while a
PAR is running process, so some sort of API sub would still be useful to
allow scripts to re-extract if file opens fail. This would not be
simple to apply where file opens are called deep inside third party
modules, but will still benefit the simpler cases.

WRT the different cache areas, PAR::_extract_inc currently spends up to
10 seconds trying to create a lock file, so that line needs to be
modified. Adding a check for -w on the target temp location should be
enough to avoid that when the exe file is in a non-writable directory.

Apart from possible security issues, this approach could also lead to
multiple par temp folders which could confuse users. Of course, being
"up front" about files could be a useful thing to do.

Regards,
Shawn.

Post by Roderich Schupp
Here are two other ideas for a workaround for "some cleanup program
purged some files in my cache area".
* make sure extracted files get the timestamp of extraction, not the
timestamp recorded in the zip archive
* add a dummy file, say REARGUARD.txt, at the top of the cache area
* extract it as the first file (so that it becomes the oldest
extracted file)
* at packed executable startup, check that REARGUARD.txt exists,
otherwise re-extract all files
* when the packed executable foo.exe is installed as
/some/path/foo.exe, extract to /some/path/foo.unpacked if this
directory can be created
* otherwise fall back to current behaviour, ie. extract to $TMP/par-USER
* I'm not sure about the security implications
Cheers, Roderich

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis

UNSW CRICOS Provider Code 00098G

Roderich Schupp

2014-12-19 11:57:17 UTC

The extractions currently use Archive::Unzip::Burst as the first attempt.
I don't know its behaviour, but I assume that if some files exist and are
locked then it will fail.

It's also protected by the $inc.lock "mutex", just like the slow path
(using Archive::Zip).

In that event the approach switches to iterating over $zip->memberNames,
so it will still work. (Actually, Archive::Unzip::Burst seems not to
install on Windows, so the latter approach will be the norm on that OS).

Probably very few people use it. It's just a Perl small wrapper around
InfoZip unzip.

Of course, this won't fix the issues where files are cleaned up while a PAR

is running process,

Obviously we will only re-extract *missing* files, so no problem with
locked files here.

so some sort of API sub would still be useful to allow scripts to
re-extract if file opens fail.

That would imply that you are able to know when opening a file from "deep
inside third party modules" fails - how?

WRT the different cache areas, PAR::_extract_inc currently spends up to 10

seconds trying to create a lock file, so that line needs to be modified.

Nope. It's not a lock file, it's a lock *directory*. That's because
creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into play
in the contended case.

Adding a check for -w on the target temp location should be enough to avoid

that when the exe file is in a non-writable directory.

The parent directory of $inc in _extract_inc is writable by construction.

Cheers, Roderich

Shawn Laffan

2014-12-20 02:52:13 UTC

Thanks for the clarifications about Archive::Unzip::Burst and the lock
process.

The example I was (unclearly) referring to is one where the PAR packed
script tries to re-access packed files which were purged by a clean-up
while the the PAR process is running. An example from my use-case is
when a now-deleted glade file is re-accessed to initialise some new
widgets, and thus would cause exceptions in the application.

There is nothing PAR can do automatically in this case, and nor should
it. It is best left to any packed script and modules to trigger a
re-extraction, or lock these files in the first place. The developer
packing the script can easily adjust their own code to allow for these
cases, but the addition of such re-extraction to dependencies maintained
by others would be more involved or simply unlikely. Any PAR-automated
approach would be fraught and probably fragile. I think this last point
relates back to your earlier comment about not wanting PAR to override
CORE::open, with which I agree.

Regards,
Shawn.

On Thu, Dec 18, 2014 at 10:22 PM, Shawn Laffan
The extractions currently use Archive::Unzip::Burst as the first
attempt. I don't know its behaviour, but I assume that if some
files exist and are locked then it will fail.
It's also protected by the $inc.lock "mutex", just like the slow path
(using Archive::Zip).
In that event the approach switches to iterating over
$zip->memberNames, so it will still work. (Actually,
Archive::Unzip::Burst seems not to install on Windows, so the
latter approach will be the norm on that OS).
Probably very few people use it. It's just a Perl small wrapper around
InfoZip unzip.
Of course, this won't fix the issues where files are cleaned up
while a PAR is running process,
Obviously we will only re-extract *missing* files, so no problem with
locked files here.
so some sort of API sub would still be useful to allow scripts to
re-extract if file opens fail.
That would imply that you are able to know when opening a file from
"deep inside third party modules" fails - how?
WRT the different cache areas, PAR::_extract_inc currently spends
up to 10 seconds trying to create a lock file, so that line needs
to be modified.
Nope. It's not a lock file, it's a lock *directory*. That's because
creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into
play in the contended case.
Adding a check for -w on the target temp location should be enough
to avoid that when the exe file is in a non-writable directory.
The parent directory of $inc in _extract_inc is writable by
construction.
Cheers, Roderich

Shawn Laffan

2014-12-21 10:05:33 UTC

I've had a go at implementing the rearguard approach, albeit with the
working name PAR_CANARY.txt instead of REARGUARD.txt.

It's still in progress, but if people would like to comment or critique
then changes are at https://github.com/shawnlaffan/perl_par_reinstate_cache

Some key points are:

1. The approach I have taken is to add a method
PAR::Packer::_add_canary_file, but possibly it should simply be appended
to the -a file list.

2. Name clashes are unlikely, but perhaps the canary file name needs to
be a bit more unique, or maybe it can be located in the script folder
instead of the top level of the inc dir.

3. There also needs to be a method to access the canary file name as it
is currently hard coded in both PAR::Packer and PAR.pm, but I'm not sure
where best to locate that.

4. The test file is under the Par-Packer folder,
t/21-pp_reinstate_cached_files.t. It could perhaps later be merged into
20-pp.t since it copies large amounts of code from that file. It needs
"perl Makefile.PL; make" to have been run so it can access the relevant
files.

I also included the added files in the manifest under
https://github.com/shawnlaffan/perl_par_reinstate_cache/commit/6cad3a0562aca287834a8b926ce2ed3f9b7552e1,
assuming that was a bug as Roderich suggested.

Regards,
Shawn.

On Thu, Dec 18, 2014 at 10:22 PM, Shawn Laffan
The extractions currently use Archive::Unzip::Burst as the first
attempt. I don't know its behaviour, but I assume that if some
files exist and are locked then it will fail.
It's also protected by the $inc.lock "mutex", just like the slow path
(using Archive::Zip).
In that event the approach switches to iterating over
$zip->memberNames, so it will still work. (Actually,
Archive::Unzip::Burst seems not to install on Windows, so the
latter approach will be the norm on that OS).
Probably very few people use it. It's just a Perl small wrapper around
InfoZip unzip.
Of course, this won't fix the issues where files are cleaned up
while a PAR is running process,
Obviously we will only re-extract *missing* files, so no problem with
locked files here.
so some sort of API sub would still be useful to allow scripts to
re-extract if file opens fail.
That would imply that you are able to know when opening a file from
"deep inside third party modules" fails - how?
WRT the different cache areas, PAR::_extract_inc currently spends
up to 10 seconds trying to create a lock file, so that line needs
to be modified.
Nope. It's not a lock file, it's a lock *directory*. That's because
creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into
play in the contended case.
Adding a check for -w on the target temp location should be enough
to avoid that when the exe file is in a non-writable directory.
The parent directory of $inc in _extract_inc is writable by
construction.
Cheers, Roderich

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis

UNSW CRICOS Provider Code 00098G

Shawn Laffan

2015-01-20 05:58:46 UTC

A candidate patch for this issue is now ready for review.

https://github.com/shawnlaffan/perl_par_reinstate_cache/blob/master/canary.patch

Key points are more or less as per my previous email (below) except that
I have added a sub called PAR::get_canary_file_name to avoid any hard
coding. I have also decreased the verbosity threshold for packing files
into /inc, as previously they needed double verbosity flags in pp to
trigger.

Thoughts?

Regards,
Shawn.

Post by Shawn Laffan
I've had a go at implementing the rearguard approach, albeit with the
working name PAR_CANARY.txt instead of REARGUARD.txt.
It's still in progress, but if people would like to comment or
critique then changes are at
https://github.com/shawnlaffan/perl_par_reinstate_cache
1. The approach I have taken is to add a method
PAR::Packer::_add_canary_file, but possibly it should simply be
appended to the -a file list.
2. Name clashes are unlikely, but perhaps the canary file name needs
to be a bit more unique, or maybe it can be located in the script
folder instead of the top level of the inc dir.
3. There also needs to be a method to access the canary file name as
it is currently hard coded in both PAR::Packer and PAR.pm, but I'm not
sure where best to locate that.
4. The test file is under the Par-Packer folder,
t/21-pp_reinstate_cached_files.t. It could perhaps later be merged
into 20-pp.t since it copies large amounts of code from that file. It
needs "perl Makefile.PL; make" to have been run so it can access the
relevant files.
I also included the added files in the manifest under
https://github.com/shawnlaffan/perl_par_reinstate_cache/commit/6cad3a0562aca287834a8b926ce2ed3f9b7552e1,
assuming that was a bug as Roderich suggested.
Regards,
Shawn.

On Thu, Dec 18, 2014 at 10:22 PM, Shawn Laffan
The extractions currently use Archive::Unzip::Burst as the first
attempt. I don't know its behaviour, but I assume that if some
files exist and are locked then it will fail.
It's also protected by the $inc.lock "mutex", just like the slow path
(using Archive::Zip).
In that event the approach switches to iterating over
$zip->memberNames, so it will still work. (Actually,
Archive::Unzip::Burst seems not to install on Windows, so the
latter approach will be the norm on that OS).
Probably very few people use it. It's just a Perl small wrapper
around InfoZip unzip.
Of course, this won't fix the issues where files are cleaned up
while a PAR is running process,
Obviously we will only re-extract *missing* files, so no problem with
locked files here.
so some sort of API sub would still be useful to allow scripts to
re-extract if file opens fail.
That would imply that you are able to know when opening a file from
"deep inside third party modules" fails - how?
WRT the different cache areas, PAR::_extract_inc currently spends
up to 10 seconds trying to create a lock file, so that line needs
to be modified.
Nope. It's not a lock file, it's a lock *directory*. That's because
creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into
play in the contended case.
Adding a check for -w on the target temp location should be
enough to avoid that when the exe file is in a non-writable directory.
The parent directory of $inc in _extract_inc is writable by
construction.
Cheers, Roderich

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis
UNSW CRICOS Provider Code 00098G

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse (free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis

UNSW CRICOS Provider Code 00098G

Kime Philip

2015-01-26 14:06:26 UTC

Did anyone try this yet? I haven’t had chance but would very much like to get this fixed as it’s the number one bug in biber right now due to such disappearing cache files.

PK

Post by Shawn Laffan
A candidate patch for this issue is now ready for review.
https://github.com/shawnlaffan/perl_par_reinstate_cache/blob/master/canary.patch
Key points are more or less as per my previous email (below) except that I have added a sub called PAR::get_canary_file_name to avoid any hard coding. I have also decreased the verbosity threshold for packing files into /inc, as previously they needed double verbosity flags in pp to trigger.
Thoughts?
Regards,
Shawn.

I've had a go at implementing the rearguard approach, albeit with the working name PAR_CANARY.txt instead of REARGUARD.txt.
It's still in progress, but if people would like to comment or critique then changes are at https://github.com/shawnlaffan/perl_par_reinstate_cache
1. The approach I have taken is to add a method PAR::Packer::_add_canary_file, but possibly it should simply be appended to the -a file list.
2. Name clashes are unlikely, but perhaps the canary file name needs to be a bit more unique, or maybe it can be located in the script folder instead of the top level of the inc dir.
3. There also needs to be a method to access the canary file name as it is currently hard coded in both PAR::Packer and PAR.pm, but I'm not sure where best to locate that.
4. The test file is under the Par-Packer folder, t/21-pp_reinstate_cached_files.t. It could perhaps later be merged into 20-pp.t since it copies large amounts of code from that file. It needs "perl Makefile.PL; make" to have been run so it can access the relevant files.
I also included the added files in the manifest under https://github.com/shawnlaffan/perl_par_reinstate_cache/commit/6cad3a0562aca287834a8b926ce2ed3f9b7552e1, assuming that was a bug as Roderich suggested.
Regards,
Shawn.

The extractions currently use Archive::Unzip::Burst as the first attempt. I don't know its behaviour, but I assume that if some files exist and are locked then it will fail.
It's also protected by the $inc.lock "mutex", just like the slow path (using Archive::Zip).
In that event the approach switches to iterating over $zip->memberNames, so it will still work. (Actually, Archive::Unzip::Burst seems not to install on Windows, so the latter approach will be the norm on that OS).
Probably very few people use it. It's just a Perl small wrapper around InfoZip unzip.
Of course, this won't fix the issues where files are cleaned up while a PAR is running process,
Obviously we will only re-extract *missing* files, so no problem with locked files here.
so some sort of API sub would still be useful to allow scripts to re-extract if file opens fail.
That would imply that you are able to know when opening a file from "deep inside third party modules" fails - how?
WRT the different cache areas, PAR::_extract_inc currently spends up to 10 seconds trying to create a lock file, so that line needs to be modified.
Nope. It's not a lock file, it's a lock *directory*. That's because creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into play in the contended case.
Adding a check for -w on the target temp location should be enough to avoid that when the exe file is in a non-writable directory.
The parent directory of $inc in _extract_inc is writable by construction.
Cheers, Roderich

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse
(free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis
UNSW CRICOS Provider Code 00098G

--
Assoc Prof Shawn Laffan
School of Biological, Earth and Environmental Sciences
UNSW, Sydney 2052, Australia
Tel +61 2 9385 8093 Fax +61 2 9385 1558
http://www.bees.unsw.edu.au/staff/shawn-laffan
http://www.purl.org/biodiverse
(free diversity analysis software)
http://www.tandf.co.uk/journals/ijgis
UNSW CRICOS Provider Code 00098G

--
Dr Philip Kime

Shawn Laffan

2015-01-28 23:13:05 UTC

I assume Roderich is busy at the moment, and it's a moderately sized
patch so will take more time to review than the other recent patches.

Roderich also mentioned in another thread that such things should go in
RT so they don't get missed. I'll do so shortly.

Shawn.

Post by Kime Philip
Did anyone try this yet? I haven’t had chance but would very much like to get this fixed as it’s the number one bug in biber right now due to such disappearing cache files.
PK

Post by Shawn Laffan
A candidate patch for this issue is now ready for review.
https://github.com/shawnlaffan/perl_par_reinstate_cache/blob/master/canary.patch
Key points are more or less as per my previous email (below) except that I have added a sub called PAR::get_canary_file_name to avoid any hard coding. I have also decreased the verbosity threshold for packing files into /inc, as previously they needed double verbosity flags in pp to trigger.
Thoughts?
Regards,
Shawn.

I've had a go at implementing the rearguard approach, albeit with the working name PAR_CANARY.txt instead of REARGUARD.txt.
It's still in progress, but if people would like to comment or critique then changes are at https://github.com/shawnlaffan/perl_par_reinstate_cache
1. The approach I have taken is to add a method PAR::Packer::_add_canary_file, but possibly it should simply be appended to the -a file list.
2. Name clashes are unlikely, but perhaps the canary file name needs to be a bit more unique, or maybe it can be located in the script folder instead of the top level of the inc dir.
3. There also needs to be a method to access the canary file name as it is currently hard coded in both PAR::Packer and PAR.pm, but I'm not sure where best to locate that.
4. The test file is under the Par-Packer folder, t/21-pp_reinstate_cached_files.t. It could perhaps later be merged into 20-pp.t since it copies large amounts of code from that file. It needs "perl Makefile.PL; make" to have been run so it can access the relevant files.
I also included the added files in the manifest under https://github.com/shawnlaffan/perl_par_reinstate_cache/commit/6cad3a0562aca287834a8b926ce2ed3f9b7552e1, assuming that was a bug as Roderich suggested.
Regards,
Shawn.

The extractions currently use Archive::Unzip::Burst as the first attempt. I don't know its behaviour, but I assume that if some files exist and are locked then it will fail.
It's also protected by the $inc.lock "mutex", just like the slow path (using Archive::Zip).
In that event the approach switches to iterating over $zip->memberNames, so it will still work. (Actually, Archive::Unzip::Burst seems not to install on Windows, so the latter approach will be the norm on that OS).
Probably very few people use it. It's just a Perl small wrapper around InfoZip unzip.
Of course, this won't fix the issues where files are cleaned up while a PAR is running process,
Obviously we will only re-extract *missing* files, so no problem with locked files here.
so some sort of API sub would still be useful to allow scripts to re-extract if file opens fail.
That would imply that you are able to know when opening a file from "deep inside third party modules" fails - how?
WRT the different cache areas, PAR::_extract_inc currently spends up to 10 seconds trying to create a lock file, so that line needs to be modified.
Nope. It's not a lock file, it's a lock *directory*. That's because creation of a directory is the only portable (even network filesystem safe)
filesystem mutex operation. The up to 10 second delay comes only into play in the contended case.
Adding a check for -w on the target temp location should be enough to avoid that when the exe file is in a non-writable directory.
The parent directory of $inc in _extract_inc is writable by construction.
Cheers, Roderich

Shawn Laffan

2014-12-18 20:50:25 UTC

In that case I'll just skip inc/lib and inc/script, which is simpler
anyway.

Regards,
Shawn.

On Thu, Dec 18, 2014 at 12:18 AM, Shawn Laffan
It boils down to checking for par_tmp/inc/MANIFEST and unzipping
it if needed. None of the -a packed files are listed in the
manifest,
Ooops, that might be considered a bug...
Cheers, Roderich

Roderich Schupp

2014-12-18 07:38:40 UTC

So, my idea of inserting a BEGIN block, in the program.pl to be packaged,
that adds files to %INC should work. Then run pp with the -c (or -x)
option, then those files will be "picked up" by ScanDeps. And the packaged
program.pl would have a list of the files it might need to re-extract.

Don't. %INC is for recording loaded _modules_, don't abuse it that way.

If you are simply looking for a way to record files to be added with -a _in
your script_
(as opposed to remember them in a separate packing recipe) I suggest a
special

=for pp options...

POD command. This idea was tossed around a few years back but AFAICT nobody
stepped up to implement it.

Cheers, Roderich

Ron W

2014-12-16 21:43:58 UTC

---------- Forwarded message ----------
From: Ron W <***@gmail.com>
Date: Tue, Dec 16, 2014 at 10:39 AM
Subject: Re: Problem with cache files being deleted

Post by Shawn Laffan
What I'm thinking of is some way for a PAR executable to detect if extra
files packed using -a are missing, and then unpack them.

I just had a thought for a work-around.

When pp script.pl is run, the script is "statically scanned" for
dependencies.

When pp -c script.pl is run, after the static scan, the script is compiled
(like perl -c script.pl). I am thinking that %INC is then inspected for
additional libraries.

So, my idea is, after identifying all the extra files to be included with
-a, go back to the main .pl file and add a BEGIN block that adds those
library names to %INC. Then run "pp -c main.pl" - without any -a options -
then verify the created executable runs correctly.

As I said, I see this as a work-around.

A better solution would be for pp to save the names of all the files
included via -a just as it saves the ones it detects.

Kime Philip

2014-12-16 18:21:14 UTC

This makes sense - the file that was always mentioned was packed with —addlist which I assume is the same mechanism as -a.

PK

I had a similar case recently where a user ran CCleaner on the temp folder while the application was open. It deleted everything in the cache folder that was not file locked, so the cache folder still exists, along with several of the component files.
A bit of testing showed the .pm and other perl files are re-extracted the next time the PAR is run, but none of the files packed using -a (an icon file and a glade file in my case).
I ended up adding code to the script to manually check for file existence and, if needed, unpacking from the PAR. It's not a solution that will scale well, but I only have two such files. Example code is below, but PAR::par_handle is probably simpler now I look at the docs again.
For the more general case, is there a simple means to trigger the re-extraction of all files packed using -a?
Regards,
Shawn.
###
require Archive::Zip;
my $icon = Path::Class::file ($ENV{PAR_TEMP}, 'inc', 'Biodiverse_icon.ico');
my $icon_str = $icon->stringify;
my $folder = $icon->dir;
my $fname = $icon->basename;
my $zip = Archive::Zip->new($ENV{PAR_PROGNAME})
or die "Unable to open $ENV{PAR_PROGNAME}";
my $success = $zip->extractMember ( $fname, $icon_str );

Post by Kime Philip
Yes, that’s what I though, just scheduled /tmp cleanup etc. but in fact it seems to happen more often on non-Unix …
Certainly some cases were due to users Ctrl-Cing a first .exe run because they thought it was taking too long (normal for it to take longer on first run of a new version of course, due to unpacking) and then the cache building was incomplete. However, most reports are, they say, “suddenly” it no longer works and complains about missing files. I can’t really explain it. Deleting the cache and running again always fixes it.
PK

Post by Kime Philip
http://tex.stackexchange.com/questions/140814/biblatex-biber-fails-with-a-strange-error-about-missing-recode-data-xml-file
From what I have seen, when this file is missing, all of the .pm files in inc/ are missing too although I cannot reliably reproduce this. It is a very frequent error from many users on various platforms and I’d very much like to find out why it happens …
I'm fairly confident that PAR::Packer doesn't meddle with the files in the cache area of
a pp-packed executable. AFAICT there's nothing in the code to delete files there (except
to purge the whole cache area immediatly after running the executable).
At least for *nix installations I can think of an explanation: administrative cron jobs that clean up
/tmp, /var/tmp etc by purging files not modified for some time. Dunno for Windows, esp.
since the "temp" directories there are typically specific to each user.
Cheers, Roderich

--
Dr Philip Kime

--
Dr Philip Kime

Roderich Schupp

2014-12-16 10:37:14 UTC

Post by Kime Philip
Certainly some cases were due to users Ctrl-Cing a first .exe run because
they thought it was taking too long (normal for it to take longer on first
run of a new version of course

Sure - if the next run sees that the cache area (ie. its top level
directory) already exists it assumes
that it is complete and skips the "unpack to cache area" step.

For Windows: Shawn just mentioned CCleaner. Windows itself comes with "Disk
Cleanup" that
will delete "Temporary files" not modified in the last 7 days...

Cheers, Roderich

25 Replies
4 Views
Permalink to this page
Disable enhanced parsing

Thread Navigation

Kime Philip 2014-12-15 20:15:13 UTC

Roderich Schupp 2014-12-16 10:02:52 UTC

Kime Philip 2014-12-16 10:06:15 UTC

Shawn Laffan 2014-12-16 10:29:56 UTC

Roderich Schupp 2014-12-16 10:49:24 UTC

Shawn Laffan 2014-12-16 11:16:01 UTC

Roderich Schupp 2014-12-16 13:04:39 UTC

Shawn Laffan 2014-12-16 20:27:49 UTC

Shawn Laffan 2014-12-16 20:29:07 UTC

Shawn Laffan 2014-12-16 21:35:26 UTC

Roderich Schupp 2014-12-17 09:12:22 UTC

Shawn Laffan 2014-12-17 23:18:44 UTC

Roderich Schupp 2014-12-18 07:39:40 UTC

Roderich Schupp 2014-12-18 08:02:47 UTC

Shawn Laffan 2014-12-18 21:22:52 UTC

Roderich Schupp 2014-12-19 11:57:17 UTC

Shawn Laffan 2014-12-20 02:52:13 UTC

Shawn Laffan 2014-12-21 10:05:33 UTC

Shawn Laffan 2015-01-20 05:58:46 UTC

Kime Philip 2015-01-26 14:06:26 UTC

Shawn Laffan 2015-01-28 23:13:05 UTC

Shawn Laffan 2014-12-18 20:50:25 UTC

Roderich Schupp 2014-12-18 07:38:40 UTC

Ron W 2014-12-16 21:43:58 UTC

Kime Philip 2014-12-16 18:21:14 UTC

Roderich Schupp 2014-12-16 10:37:14 UTC

about - legalese

Loading...