diff -Nru python-parsel-1.5.0/.bumpversion.cfg python-parsel-1.5.2/.bumpversion.cfg --- python-parsel-1.5.0/.bumpversion.cfg 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/.bumpversion.cfg 2019-08-09 11:23:46.000000000 +0000 @@ -1,5 +1,5 @@ [bumpversion] -current_version = 1.5.0 +current_version = 1.5.2 commit = True tag = True tag_name = v{new_version} diff -Nru python-parsel-1.5.0/debian/changelog python-parsel-1.5.2/debian/changelog --- python-parsel-1.5.0/debian/changelog 2019-08-02 18:35:23.000000000 +0000 +++ python-parsel-1.5.2/debian/changelog 2019-08-10 16:11:30.000000000 +0000 @@ -1,3 +1,11 @@ +python-parsel (1.5.2-1) unstable; urgency=medium + + * New upstream version. + * Update the minimum python3-w3lib version to 1.19.0. + * Add Depends: python3-lxml explicitly. + + -- Andrey Rahmatullin Sat, 10 Aug 2019 21:11:30 +0500 + python-parsel (1.5.0-3) unstable; urgency=medium * Drop Python 2 support. diff -Nru python-parsel-1.5.0/debian/control python-parsel-1.5.2/debian/control --- python-parsel-1.5.0/debian/control 2019-08-02 18:35:23.000000000 +0000 +++ python-parsel-1.5.2/debian/control 2019-08-10 16:11:30.000000000 +0000 @@ -15,7 +15,7 @@ python3-pytest-runner, python3-setuptools, python3-six, - python3-w3lib (>= 1.8.0), + python3-w3lib (>= 1.19.0), Standards-Version: 4.4.0 Rules-Requires-Root: no Vcs-Browser: https://salsa.debian.org/python-team/modules/python-parsel @@ -28,6 +28,7 @@ Depends: ${misc:Depends}, ${python3:Depends}, + python3-lxml Description: Python 3 library to extract HTML/XML data using XPath/CSS selectors Parsel is a Python library to extract data from HTML and XML using XPath and CSS selectors diff -Nru python-parsel-1.5.0/docs/usage.rst python-parsel-1.5.2/docs/usage.rst --- python-parsel-1.5.0/docs/usage.rst 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/docs/usage.rst 2019-08-09 11:23:46.000000000 +0000 @@ -119,7 +119,7 @@ 'image4_thumb.jpg', 'image5_thumb.jpg'] -If you want to extract only first matched element, you can call the +If you want to extract only the first matched element, you can call the selector ``.get()`` (or its alias ``.extract_first()`` commonly used in previous parsel versions):: @@ -382,7 +382,7 @@ For more details about relative XPaths see the `Location Paths`_ section in the XPath specification. -.. _Location Paths: http://www.w3.org/TR/xpath#location-paths +.. _Location Paths: https://www.w3.org/TR/xpath#location-paths Using EXSLT extensions @@ -530,6 +530,8 @@ .. _regular expressions: http://exslt.org/regexp/index.html .. _set manipulation: http://exslt.org/set/index.html +.. _topics-xpath-other-extensions: + Other XPath extensions ---------------------- @@ -582,7 +584,7 @@ .. _`XPath tutorial`: http://www.zvon.org/comp/r/tut-XPath_1.html -.. _`this post from ScrapingHub's blog`: http://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/ +.. _`this post from ScrapingHub's blog`: https://blog.scrapinghub.com/2014/07/17/xpath-tips-from-the-web-scraping-trenches/ Using text nodes in a condition @@ -624,7 +626,7 @@ >>> sel.xpath("//a[contains(., 'Next Page')]").getall() ['Click here to go to the Next Page'] -.. _`XPath string function`: http://www.w3.org/TR/xpath/#section-String-Functions +.. _`XPath string function`: https://www.w3.org/TR/xpath/#section-String-Functions Beware of the difference between //node[1] and (//node)[1] ~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~~ @@ -703,7 +705,7 @@ are still supported by parsel, there are no plans to deprecate them. However, ``parsel`` usage docs are now written using ``.get()`` and -``.getall()`` methods. We feel that these new methods result in a more concise +``.getall()`` methods. We feel that these new methods result in more concise and readable code. The following examples show how these methods map to each other. @@ -722,7 +724,7 @@ >>> selector.css('a::attr(href)').extract() ['image1.html', 'image2.html', 'image3.html', 'image4.html', 'image5.html'] -2. ``Selector.get()`` is the same as ``Selector.extract()``:: +3. ``Selector.get()`` is the same as ``Selector.extract()``:: >>> selector.css('a::attr(href)')[0].get() 'image1.html' @@ -734,11 +736,13 @@ >>> selector.css('a::attr(href)')[0].getall() ['image1.html'] -So, the main difference is that output of ``.get()`` and ``.getall()`` methods -is more predictable: ``.get()`` always returns a single result, ``.getall()`` -always returns a list of all extracted results. With ``.extract()`` method -it was not always obvious if a result is a list or not; to get a single -result either ``.extract()`` or ``.extract_first()`` should be called. +With the ``.extract()`` method it was not always obvious if a result is a list +or not; to get a single result either ``.extract()`` or ``.extract_first()`` +needed to be called, depending whether you had a ``Selector`` or ``SelectorList``. + +So, the main difference is that the outputs of ``.get()`` and ``.getall()`` +are more predictable: ``.get()`` always returns a single result, +``.getall()`` always returns a list of all extracted results. .. _topics-selectors-ref: @@ -822,26 +826,31 @@ simple/convenient XPaths. You can use the :meth:`Selector.remove_namespaces` method for that. -Let's show an example that illustrates this with Github blog atom feed. +Let's show an example that illustrates this with the Python Insider blog atom feed. Let's download the atom feed using `requests`_ and create a selector:: >>> import requests >>> from parsel import Selector - >>> text = requests.get('https://github.com/blog.atom').text + >>> text = requests.get('https://feeds.feedburner.com/PythonInsider').text >>> sel = Selector(text=text, type='xml') This is how the file starts:: - - tag:github.com,2008:/blog + ... -You can see two namespace declarations: a default "http://www.w3.org/2005/Atom" -and another one using the "media:" prefix for "http://search.yahoo.com/mrss/". +You can see several namespace declarations including a default +"http://www.w3.org/2005/Atom" and another one using the "gd:" prefix for +"http://schemas.google.com/g/2005". We can try selecting all ```` objects and then see that it doesn't work (because the Atom XML namespace is obfuscating those nodes):: @@ -854,8 +863,8 @@ >>> sel.remove_namespaces() >>> sel.xpath("//link") - [, - , + [, + , ... If you wonder why the namespace removal procedure isn't called always by default @@ -881,11 +890,11 @@ references along with the query, through a ``namespaces`` argument, with the prefixes you declare being used in your XPath or CSS query. -Let's use the same Atom feed from Github:: +Let's use the same Python Insider Atom feed:: >>> import requests >>> from parsel import Selector - >>> text = requests.get('https://github.com/blog.atom').text + >>> text = requests.get('https://feeds.feedburner.com/PythonInsider').text >>> sel = Selector(text=text, type='xml') And try to select the links again, now using an "atom:" prefix @@ -898,13 +907,14 @@ You can pass several namespaces (here we're using shorter 1-letter prefixes):: - >>> sel.xpath("//a:entry/m:thumbnail/@url", - ... namespaces={"a": "http://www.w3.org/2005/Atom", - ... "m": "http://search.yahoo.com/mrss/"}).getall() - ['https://avatars1.githubusercontent.com/u/11529908?v=3&s=60', - 'https://avatars0.githubusercontent.com/u/15114852?v=3&s=60', + >>> sel.xpath("//a:entry/a:author/g:image/@src", + ... namespaces={"a": "http://www.w3.org/2005/Atom", + ... "g": "http://schemas.google.com/g/2005"}).getall() + ['http://photos1.blogger.com/blogger/4554/1119/400/beethoven_10.jpg', + '//lh3.googleusercontent.com/-7xisiK0EArc/AAAAAAAAAAI/AAAAAAAAAuM/-r6o6A8RKCM/s512-c/photo.jpg', ... +.. _topics-xpath-variables: Variables in XPath expressions ------------------------------ diff -Nru python-parsel-1.5.0/NEWS python-parsel-1.5.2/NEWS --- python-parsel-1.5.0/NEWS 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/NEWS 2019-08-09 11:23:46.000000000 +0000 @@ -3,6 +3,26 @@ History ------- +1.5.2 (2019-08-09) +~~~~~~~~~~~~~~~~~~ + +* ``Selector.remove_namespaces`` received a significant performance improvement +* The value of ``data`` within the printable representation of a selector + (``repr(selector)``) now ends in ``...`` when truncated, to make the + truncation obvious. +* Minor documentation improvements. + + +1.5.1 (2018-10-25) +~~~~~~~~~~~~~~~~~~ + +* ``has-class`` XPath function handles newlines and other separators + in class names properly; +* fixed parsing of HTML documents with null bytes; +* documentation improvements; +* Python 3.7 tests are run on CI; other test improvements. + + 1.5.0 (2018-07-04) ~~~~~~~~~~~~~~~~~~ diff -Nru python-parsel-1.5.0/parsel/__init__.py python-parsel-1.5.2/parsel/__init__.py --- python-parsel-1.5.0/parsel/__init__.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/parsel/__init__.py 2019-08-09 11:23:46.000000000 +0000 @@ -5,7 +5,7 @@ __author__ = 'Scrapy project' __email__ = 'info@scrapy.org' -__version__ = '1.5.0' +__version__ = '1.5.2' from parsel.selector import Selector, SelectorList # NOQA from parsel.csstranslator import css2xpath # NOQA diff -Nru python-parsel-1.5.0/parsel/selector.py python-parsel-1.5.2/parsel/selector.py --- python-parsel-1.5.0/parsel/selector.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/parsel/selector.py 2019-08-09 11:23:46.000000000 +0000 @@ -7,7 +7,7 @@ import six from lxml import etree, html -from .utils import flatten, iflatten, extract_regex +from .utils import flatten, iflatten, extract_regex, shorten from .csstranslator import HTMLTranslator, GenericTranslator @@ -38,7 +38,7 @@ def create_root_node(text, parser_cls, base_url=None): """Create root node for text using given parser class. """ - body = text.strip().encode('utf8') or b'' + body = text.strip().replace('\x00', '').encode('utf8') or b'' parser = parser_cls(recover=True, encoding='utf8') root = etree.fromstring(body, parser=parser, base_url=base_url) if root is None: @@ -258,6 +258,8 @@ In the background, CSS queries are translated into XPath queries using `cssselect`_ library and run ``.xpath()`` method. + + .. _cssselect: https://pypi.python.org/pypi/cssselect/ """ return self.xpath(self._css2xpath(query)) @@ -273,7 +275,7 @@ will be compiled to a regular expression using ``re.compile(regex)``. By default, character entity references are replaced by their - corresponding character (except for ``&`` and ``<``. + corresponding character (except for ``&`` and ``<``). Passing ``replace_entities`` as ``False`` switches off these replacements. """ @@ -286,7 +288,7 @@ the argument is not provided). By default, character entity references are replaced by their - corresponding character (except for ``&`` and ``<``. + corresponding character (except for ``&`` and ``<``). Passing ``replace_entities`` as ``False`` switches off these replacements. """ @@ -337,8 +339,8 @@ for an in el.attrib.keys(): if an.startswith('{'): el.attrib[an.split('}', 1)[1]] = el.attrib.pop(an) - # remove namespace declarations - etree.cleanup_namespaces(self.root) + # remove namespace declarations + etree.cleanup_namespaces(self.root) @property def attrib(self): @@ -356,6 +358,6 @@ __nonzero__ = __bool__ def __str__(self): - data = repr(self.get()[:40]) + data = repr(shorten(self.get(), width=40)) return "<%s xpath=%r data=%s>" % (type(self).__name__, self._expr, data) __repr__ = __str__ diff -Nru python-parsel-1.5.0/parsel/utils.py python-parsel-1.5.2/parsel/utils.py --- python-parsel-1.5.0/parsel/utils.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/parsel/utils.py 2019-08-09 11:23:46.000000000 +0000 @@ -80,4 +80,15 @@ strings = flatten(strings) if not replace_entities: return strings - return [w3lib_replace_entities(s, keep=['lt', 'amp']) for s in strings] \ No newline at end of file + return [w3lib_replace_entities(s, keep=['lt', 'amp']) for s in strings] + + +def shorten(text, width, suffix='...'): + """Truncate the given text to fit in the given width.""" + if len(text) <= width: + return text + if width > len(suffix): + return text[:width-len(suffix)] + suffix + if width >= 0: + return suffix[len(suffix)-width:] + raise ValueError('width must be equal or greater than 0') diff -Nru python-parsel-1.5.0/parsel/xpathfuncs.py python-parsel-1.5.2/parsel/xpathfuncs.py --- python-parsel-1.5.0/parsel/xpathfuncs.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/parsel/xpathfuncs.py 2019-08-09 11:23:46.000000000 +0000 @@ -1,7 +1,13 @@ +import re from lxml import etree from six import string_types +from w3lib.html import HTML5_WHITESPACE + +regex = '[{}]+'.format(HTML5_WHITESPACE) +replace_html5_whitespaces = re.compile(regex).sub + def set_xpathfunc(fname, func): """Register a custom extension function to use in XPath expressions. @@ -48,6 +54,7 @@ if node_cls is None: return False node_cls = ' ' + node_cls + ' ' + node_cls = replace_html5_whitespaces(' ', node_cls) for cls in classes: if ' ' + cls + ' ' not in node_cls: return False diff -Nru python-parsel-1.5.0/README.rst python-parsel-1.5.2/README.rst --- python-parsel-1.5.0/README.rst 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/README.rst 2019-08-09 11:23:46.000000000 +0000 @@ -35,7 +35,7 @@ """) >>> diff -Nru python-parsel-1.5.0/setup.py python-parsel-1.5.2/setup.py --- python-parsel-1.5.0/setup.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/setup.py 2019-08-09 11:23:46.000000000 +0000 @@ -26,8 +26,9 @@ return parse_version(setuptools_version) >= parse_version('18.5') install_requires = [ - 'w3lib>=1.8.0', - 'lxml>=2.3', + 'w3lib>=1.19.0', + 'lxml;python_version!="3.4"', + 'lxml<=4.3.5;python_version=="3.4"', 'six>=1.5.2', 'cssselect>=0.9' ] @@ -41,7 +42,7 @@ setup( name='parsel', - version='1.5.0', + version='1.5.2', description="Parsel is a library to extract data from HTML and XML using XPath and CSS selectors", long_description=readme + '\n\n' + history, author="Scrapy project", @@ -72,6 +73,7 @@ 'Programming Language :: Python :: 3.4', 'Programming Language :: Python :: 3.5', 'Programming Language :: Python :: 3.6', + 'Programming Language :: Python :: 3.7', 'Programming Language :: Python :: Implementation :: CPython', 'Programming Language :: Python :: Implementation :: PyPy', ], diff -Nru python-parsel-1.5.0/tests/test_selector.py python-parsel-1.5.2/tests/test_selector.py --- python-parsel-1.5.0/tests/test_selector.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/tests/test_selector.py 2019-08-09 11:23:46.000000000 +0000 @@ -114,16 +114,16 @@ """ sel = self.sscls(text=body) - self.assertEquals({'lang': 'en', 'version': '1.0'}, sel.attrib) - self.assertEquals({'id': 'some-list', 'class': 'list-cls'}, sel.css('ul')[0].attrib) + self.assertEqual({'lang': 'en', 'version': '1.0'}, sel.attrib) + self.assertEqual({'id': 'some-list', 'class': 'list-cls'}, sel.css('ul')[0].attrib) # for a SelectorList, bring the attributes of first-element only - self.assertEquals({'id': 'some-list', 'class': 'list-cls'}, sel.css('ul').attrib) - self.assertEquals({'class': 'item-cls', 'id': 'list-item-1'}, sel.css('li').attrib) - self.assertEquals({}, sel.css('body').attrib) - self.assertEquals({}, sel.css('non-existing-element').attrib) + self.assertEqual({'id': 'some-list', 'class': 'list-cls'}, sel.css('ul').attrib) + self.assertEqual({'class': 'item-cls', 'id': 'list-item-1'}, sel.css('li').attrib) + self.assertEqual({}, sel.css('body').attrib) + self.assertEqual({}, sel.css('non-existing-element').attrib) - self.assertEquals( + self.assertEqual( [{'class': 'item-cls', 'id': 'list-item-1'}, {'class': 'item-cls active', 'id': 'list-item-2'}, {'class': 'item-cls', 'id': 'list-item-3'}], @@ -133,9 +133,9 @@ body = u"

".format(50 * 'b') sel = self.sscls(text=body) - representation = "".format(40 * 'b') + representation = "".format(37 * 'b') if six.PY2: - representation = "".format(40 * 'b') + representation = "".format(37 * 'b') self.assertEqual( [repr(it) for it in sel.xpath('//input/@name')], @@ -211,17 +211,17 @@ body = u'
  • 1
  • 2
' sel = self.sscls(text=body) - self.assertEqual(sel.xpath('//ul/li/text()').re_first('\d'), - sel.xpath('//ul/li/text()').re('\d')[0]) + self.assertEqual(sel.xpath('//ul/li/text()').re_first(r'\d'), + sel.xpath('//ul/li/text()').re(r'\d')[0]) - self.assertEqual(sel.xpath('//ul/li[@id="1"]/text()').re_first('\d'), - sel.xpath('//ul/li[@id="1"]/text()').re('\d')[0]) + self.assertEqual(sel.xpath('//ul/li[@id="1"]/text()').re_first(r'\d'), + sel.xpath('//ul/li[@id="1"]/text()').re(r'\d')[0]) - self.assertEqual(sel.xpath('//ul/li[2]/text()').re_first('\d'), - sel.xpath('//ul/li/text()').re('\d')[1]) + self.assertEqual(sel.xpath('//ul/li[2]/text()').re_first(r'\d'), + sel.xpath('//ul/li/text()').re(r'\d')[1]) - self.assertEqual(sel.xpath('/ul/li/text()').re_first('\w+'), None) - self.assertEqual(sel.xpath('/ul/li[@id="doesnt-exist"]/text()').re_first('\d'), None) + self.assertEqual(sel.xpath('/ul/li/text()').re_first(r'\w+'), None) + self.assertEqual(sel.xpath('/ul/li[@id="doesnt-exist"]/text()').re_first(r'\d'), None) self.assertEqual(sel.re_first(r'id="(\d+)'), '1') self.assertEqual(sel.re_first(r'foo'), None) @@ -232,8 +232,8 @@ body = u'
  • 1
  • 2
' sel = self.sscls(text=body) - self.assertEqual(sel.xpath('//div/text()').re_first('\w+', default='missing'), 'missing') - self.assertEqual(sel.xpath('/ul/li/text()').re_first('\w+', default='missing'), 'missing') + self.assertEqual(sel.xpath('//div/text()').re_first(r'\w+', default='missing'), 'missing') + self.assertEqual(sel.xpath('/ul/li/text()').re_first(r'\w+', default='missing'), 'missing') def test_select_unicode_query(self): body = u"

" @@ -249,8 +249,8 @@ def test_boolean_result(self): body = u"

" xs = self.sscls(text=body) - self.assertEquals(xs.xpath("//input[@name='a']/@name='a'").extract(), [u'1']) - self.assertEquals(xs.xpath("//input[@name='a']/@name='n'").extract(), [u'0']) + self.assertEqual(xs.xpath("//input[@name='a']/@name='a'").extract(), [u'1']) + self.assertEqual(xs.xpath("//input[@name='a']/@name='n'").extract(), [u'0']) def test_differences_parsing_xml_vs_html(self): """Test that XML and HTML Selector's behave differently""" @@ -493,10 +493,10 @@ """ x = self.sscls(text=body) - name_re = re.compile("Name: (\w+)") + name_re = re.compile(r"Name: (\w+)") self.assertEqual(x.xpath("//ul/li").re(name_re), ["John", "Paul"]) - self.assertEqual(x.xpath("//ul/li").re("Age: (\d+)"), + self.assertEqual(x.xpath("//ul/li").re(r"Age: (\d+)"), ["10", "20"]) # Test named group, hit and miss @@ -537,7 +537,7 @@ def test_re_intl(self): body = u'
Evento: cumplea\xf1os
' x = self.sscls(text=body) - self.assertEqual(x.xpath("//div").re("Evento: (\w+)"), [u'cumplea\xf1os']) + self.assertEqual(x.xpath("//div").re(r"Evento: (\w+)"), [u'cumplea\xf1os']) def test_selector_over_text(self): hs = self.sscls(text=u'lala') @@ -568,7 +568,7 @@ \xa3''' x = self.sscls(text=text) - self.assertEquals(x.xpath("//span[@id='blank']/text()").extract(), + self.assertEqual(x.xpath("//span[@id='blank']/text()").extract(), [u'\xa3']) def test_empty_bodies_shouldnt_raise_errors(self): @@ -576,7 +576,7 @@ def test_bodies_with_comments_only(self): sel = self.sscls(text=u'', base_url='http://example.com') - self.assertEquals(u'http://example.com', sel.root.base) + self.assertEqual(u'http://example.com', sel.root.base) def test_null_bytes_shouldnt_raise_errors(self): text = u'pre\x00post' @@ -585,27 +585,27 @@ def test_replacement_char_from_badly_encoded_body(self): # \xe9 alone isn't valid utf8 sequence text = u'

an Jos\ufffd de

' - self.assertEquals([u'an Jos\ufffd de'], - self.sscls(text).xpath('//text()').extract()) + self.assertEqual([u'an Jos\ufffd de'], + self.sscls(text).xpath('//text()').extract()) def test_select_on_unevaluable_nodes(self): r = self.sscls(text=u'some text') # Text node x1 = r.xpath('//text()') - self.assertEquals(x1.extract(), [u'some text']) - self.assertEquals(x1.xpath('.//b').extract(), []) + self.assertEqual(x1.extract(), [u'some text']) + self.assertEqual(x1.xpath('.//b').extract(), []) # Tag attribute x1 = r.xpath('//span/@class') - self.assertEquals(x1.extract(), [u'big']) - self.assertEquals(x1.xpath('.//text()').extract(), []) + self.assertEqual(x1.extract(), [u'big']) + self.assertEqual(x1.xpath('.//text()').extract(), []) def test_select_on_text_nodes(self): r = self.sscls(text=u'
Options:opt1
Otheropt2
') x1 = r.xpath("//div/descendant::text()[preceding-sibling::b[contains(text(), 'Options')]]") - self.assertEquals(x1.extract(), [u'opt1']) + self.assertEqual(x1.extract(), [u'opt1']) x1 = r.xpath("//div/descendant::text()/preceding-sibling::b[contains(text(), 'Options')]") - self.assertEquals(x1.extract(), [u'Options:']) + self.assertEqual(x1.extract(), [u'Options:']) @unittest.skip("Text nodes lost parent node reference in lxml") def test_nested_select_on_text_nodes(self): @@ -613,7 +613,7 @@ r = self.sscls(text=u'
Options:opt1
Otheropt2
') x1 = r.xpath("//div/descendant::text()") x2 = x1.xpath("./preceding-sibling::b[contains(text(), 'Options')]") - self.assertEquals(x2.extract(), [u'Options:']) + self.assertEqual(x2.extract(), [u'Options:']) def test_weakref_slots(self): """Check that classes are using slots and are weak-referenceable""" @@ -625,28 +625,61 @@ def test_remove_namespaces(self): xml = u""" - - + + + + + """ sel = self.sscls(text=xml, type='xml') self.assertEqual(len(sel.xpath("//link")), 0) self.assertEqual(len(sel.xpath("./namespace::*")), 3) sel.remove_namespaces() + self.assertEqual(len(sel.xpath("//link")), 3) + self.assertEqual(len(sel.xpath("./namespace::*")), 1) + + def test_remove_namespaces_embedded(self): + xml = u""" + + + + + + + + + + + + + + """ + sel = self.sscls(text=xml, type='xml') + self.assertEqual(len(sel.xpath("//link")), 0) + self.assertEqual(len(sel.xpath("//stop")), 0) + self.assertEqual(len(sel.xpath("./namespace::*")), 2) + self.assertEqual(len(sel.xpath("//f:link", namespaces={'f': 'http://www.w3.org/2005/Atom'})), 2) + self.assertEqual(len(sel.xpath("//s:stop", namespaces={'s': 'http://www.w3.org/2000/svg'})), 2) + sel.remove_namespaces() self.assertEqual(len(sel.xpath("//link")), 2) + self.assertEqual(len(sel.xpath("//stop")), 2) self.assertEqual(len(sel.xpath("./namespace::*")), 1) def test_remove_attributes_namespaces(self): xml = u""" - - + + + + + """ sel = self.sscls(text=xml, type='xml') self.assertEqual(len(sel.xpath("//link/@type")), 0) sel.remove_namespaces() - self.assertEqual(len(sel.xpath("//link/@type")), 2) + self.assertEqual(len(sel.xpath("//link/@type")), 3) def test_smart_strings(self): """Lxml smart strings return values""" @@ -692,8 +725,7 @@ def test_configure_base_url(self): sel = self.sscls(text=u'nothing', base_url='http://example.com') - self.assertEquals(u'http://example.com', sel.root.base) - + self.assertEqual(u'http://example.com', sel.root.base) def test_extending_selector(self): class MySelectorList(Selector.selectorlist_cls): @@ -708,6 +740,11 @@ self.assertIsInstance(sel.css('div'), MySelectorList) self.assertIsInstance(sel.css('div')[0], MySelector) + def test_replacement_null_char_from_body(self): + text = u'\x00

Grainy

' + self.assertEqual(u'

Grainy

', + self.sscls(text).extract()) + class ExsltTestCase(unittest.TestCase): sscls = Selector @@ -732,7 +769,7 @@ self.assertEqual( [x.extract() for x in sel.xpath( - '//a[re:test(@href, "\.html$")]/text()')], + r'//a[re:test(@href, "\.html$")]/text()')], [u'first link', u'second link']) self.assertEqual( [x.extract() @@ -753,20 +790,18 @@ #u'', #u'/xml/index.xml?/xml/utils/rechecker.xml'] self.assertEqual( - sel.xpath('re:match(//a[re:test(@href, "\.xml$")]/@href,' - '"(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)")/text()').extract(), + sel.xpath(r're:match(//a[re:test(@href, "\.xml$")]/@href,' + r'"(\w+):\/\/([^/:]+)(:\d*)?([^# ]*)")/text()').extract(), [u'http://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.xml', u'http', u'www.bayes.co.uk', u'', u'/xml/index.xml?/xml/utils/rechecker.xml']) - - # re:replace() self.assertEqual( - sel.xpath('re:replace(//a[re:test(@href, "\.xml$")]/@href,' - '"(\w+)://(.+)(\.xml)", "","https://\\2.html")').extract(), + sel.xpath(r're:replace(//a[re:test(@href, "\.xml$")]/@href,' + r'"(\w+)://(.+)(\.xml)", "","https://\2.html")').extract(), [u'https://www.bayes.co.uk/xml/index.xml?/xml/utils/rechecker.html']) def test_set(self): diff -Nru python-parsel-1.5.0/tests/test_utils.py python-parsel-1.5.2/tests/test_utils.py --- python-parsel-1.5.0/tests/test_utils.py 1970-01-01 00:00:00.000000000 +0000 +++ python-parsel-1.5.2/tests/test_utils.py 2019-08-09 11:23:46.000000000 +0000 @@ -0,0 +1,26 @@ +from parsel.utils import shorten + +from pytest import mark, raises +import six + + +@mark.parametrize( + 'width,expected', + ( + (-1, ValueError), + (0, u''), + (1, u'.'), + (2, u'..'), + (3, u'...'), + (4, u'f...'), + (5, u'fo...'), + (6, u'foobar'), + (7, u'foobar'), + ) +) +def test_shorten(width, expected): + if isinstance(expected, six.string_types): + assert shorten(u'foobar', width) == expected + else: + with raises(expected): + shorten(u'foobar', width) diff -Nru python-parsel-1.5.0/tests/test_xpathfuncs.py python-parsel-1.5.2/tests/test_xpathfuncs.py --- python-parsel-1.5.0/tests/test_xpathfuncs.py 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/tests/test_xpathfuncs.py 2019-08-09 11:23:46.000000000 +0000 @@ -72,6 +72,25 @@ [x.extract() for x in sel.xpath('//p[has-class("foo")]/text()')], [u'First']) + def test_has_class_newline(self): + body = u""" +

First

+ """ + sel = Selector(text=body) + self.assertEqual( + [x.extract() for x in sel.xpath(u'//p[has-class("foo")]/text()')], + [u'First']) + + def test_has_class_tab(self): + body = u""" +

First

+ """ + sel = Selector(text=body) + self.assertEqual( + [x.extract() for x in sel.xpath(u'//p[has-class("foo")]/text()')], + [u'First']) + def test_set_xpathfunc(self): def myfunc(ctx): diff -Nru python-parsel-1.5.0/tox.ini python-parsel-1.5.2/tox.ini --- python-parsel-1.5.0/tox.ini 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/tox.ini 2019-08-09 11:23:46.000000000 +0000 @@ -1,5 +1,5 @@ [tox] -envlist = py27, py34, py35, py36, pypy, pypy3 +envlist = py27, py34, py35, py36, py37, pypy, pypy3 [testenv] deps = diff -Nru python-parsel-1.5.0/.travis.yml python-parsel-1.5.2/.travis.yml --- python-parsel-1.5.0/.travis.yml 2018-07-03 21:19:19.000000000 +0000 +++ python-parsel-1.5.2/.travis.yml 2019-08-09 11:23:46.000000000 +0000 @@ -18,17 +18,21 @@ env: TOXENV=py35 - python: 3.6 env: TOXENV=py36 + - python: 3.7 + env: TOXENV=py37 + dist: xenial + sudo: true install: - | if [ "$TOXENV" = "pypy" ]; then - export PYPY_VERSION="pypy-5.9-linux_x86_64-portable" + export PYPY_VERSION="pypy-6.0.0-linux_x86_64-portable" wget "https://bitbucket.org/squeaky/portable-pypy/downloads/${PYPY_VERSION}.tar.bz2" tar -jxf ${PYPY_VERSION}.tar.bz2 virtualenv --python="$PYPY_VERSION/bin/pypy" "$HOME/virtualenvs/$PYPY_VERSION" source "$HOME/virtualenvs/$PYPY_VERSION/bin/activate" fi if [ "$TOXENV" = "pypy3" ]; then - export PYPY_VERSION="pypy3.5-5.9-beta-linux_x86_64-portable" + export PYPY_VERSION="pypy3.5-6.0.0-linux_x86_64-portable" wget "https://bitbucket.org/squeaky/portable-pypy/downloads/${PYPY_VERSION}.tar.bz2" tar -jxf ${PYPY_VERSION}.tar.bz2 virtualenv --python="$PYPY_VERSION/bin/pypy3" "$HOME/virtualenvs/$PYPY_VERSION"