8
0
Stephan Kulow
2011-12-20 13:40:27 +00:00
committed by Git OBS Bridge
parent 3d4e564a24
commit 80be1dc906
2 changed files with 13 additions and 43 deletions

View File

@@ -20,18 +20,21 @@ Name: perl-HTML-TableExtract
Version: 2.11
Release: 0
%define cpan_name HTML-TableExtract
Summary: For extracting the content contained in tables within an HTML document
Summary: Perl module for extracting the content contained in tables within an HTM[cut]
License: GPL-1.0+ or Artistic-1.0
Group: Development/Libraries/Perl
Url: http://search.cpan.org/dist/HTML-TableExtract/
Source: http://www.cpan.org/authors/id/M/MS/MSISK/HTML-TableExtract-%{version}.tar.gz
Patch0: %{cpan_name}-2.10-HTML.patch
Source: http://www.cpan.org/authors/id/M/MS/MSISK/%{cpan_name}-%{version}.tar.gz
Patch0: HTML-TableExtract-2.10-HTML.patch
BuildArch: noarch
BuildRoot: %{_tmppath}/%{name}-%{version}-build
BuildRequires: perl
BuildRequires: perl-macros
BuildRequires: perl(HTML::ElementTable) >= 1.16
BuildRequires: perl(HTML::Parser)
#BuildRequires: perl(HTML::Entities)
#BuildRequires: perl(HTML::TableExtract)
#BuildRequires: perl(testload)
Requires: perl(HTML::ElementTable) >= 1.16
Requires: perl(HTML::Parser)
%{perl_requires}
@@ -94,45 +97,10 @@ When extracting only text from tables, the text is decoded with
HTML::Entities by default; this can be disabled by setting the _decode_
parameter to 0.
Extraction Modes
The default mode of extraction for HTML::TableExtract is raw text or
HTML. In this mode, embedded tables are completely decoupled from one
another. In this case, HTML::TableExtract is a subclass of
HTML::Parser:
use HTML::TableExtract;
Alternativevly, tables can be extracted as HTML::ElementTable
structures, which are in turn embedded in an HTML::Element tree
representing the entire HTML document. Embedded tables are not
decoupled from one another since this tree structure must be
manitained. In this case, HTML::TableExtract is a subclass of
HTML::TreeBuilder (itself a subclass of HTML:::Parser):
use HTML::TableExtract qw(tree);
In either case, the basic interface for HTML::TableExtract and the
resulting table objects remains the same -- all that changes is what
you can do with the resulting data.
HTML::TableExtract is a subclass of HTML::Parser, and as such inherits
all of its basic methods such as 'parse()' and 'parse_file()'. During
scans, 'start()', 'end()', and 'text()' are utilized. Feel free to
override them, but if you do not eventually invoke them in the SUPER
class with some content, results are not guaranteed.
Advice
The main point of this module was to provide a flexible method of
extracting tabular information from HTML documents without relying to
heavily on the document layout. For that reason, I suggest using
_Headers_ whenever possible -- that way, you are anchoring your
extraction on what the document is trying to communicate rather than
some feature of the HTML comprising the document (other than the fact
that the data is contained in a table).
%prep
%setup -q -n %{cpan_name}-%{version}
%patch0 -p1
find . -type f -print0 | xargs -0 chmod 644
%build
%{__perl} Makefile.PL INSTALLDIRS=vendor
@@ -146,11 +114,8 @@ Advice
%perl_process_packlist
%perl_gen_filelist
%clean
%{__rm} -rf %{buildroot}
%files -f %{name}.files
%defattr(644,root,root,755)
%defattr(-,root,root,755)
%doc Changes README
%changelog