SocNetV 1.6 "crawl me"

Version 1.6 re-enabled a working and improved web crawler in SocNetV.

Milestone information

Project:
SocNetV
Series:
1.x
Version:
1.6
Code name:
crawl me
Released:
2015-05-12  
Registrant:
Dimitris Kalamaras
Release registered:
2015-05-12
Active:
No. Drivers cannot target bugs and blueprints to this milestone.  

Download RDF metadata

Activities

Assigned to you:
No blueprints or bugs assigned to you.
Assignees:
No users assigned to blueprints and bugs.
Blueprints:
No blueprints are targeted to this milestone.
Bugs:
No bugs are targeted to this milestone.

Download files for this release

After you've downloaded a file, you can verify its authenticity using its MD5 sum or signature. (How do I verify a download?)

File Description Downloads
download icon SocNetV-1.6.tar.bz2 (md5, sig) SocNetV v1.6 21
last downloaded 44 weeks ago
Total downloads: 21

Release notes 

The SocNetV project has just released its latest version 1.6. Binaries for Windows, Mac OS X and Linux are available from the project web site (Download menu).

The new version brings back the web crawler feature which has been disabled in the 1.x series so far.

To start the web crawler, go to menu Network > Web Crawler or press Shift+C.

A dialog will appear, where you must enter the initial web page (seed). You may also set the maximum nodes/pages (default 600) and what kind of links to crawl: internal, external or both. By default the spider will crawl both internal and external links.

The new web crawler is vastly improved from the 0.x releases and consists of two parts: a 'spider' and a 'parser', each one running on its own thread.

The spider visits a given initial URL (i.e. a website or a single webpage) and downloads its HTML code. The parser scans the downloaded code for 'href' links to other pages (internal or external) and adds them to a queue of URLs (called frontier).

As URLs are added in the queue, the spider visits them and downloads their HTML which is scanned for more links by the parser, and so on...

The process is multithreaded and completed in a matter of seconds even for 1000 urls.

The end result is the 'network' of all visited webpages as nodes and their real links as edges. To help you find some patterns right away, the nodes are by default displayed with their node sizes reflecting their outDegree.

From there, you can analyze the network using the SNA tools provided by SocNetV.

Please note that the parser searches for 'href' links only in the body section of the HTML code.

Changelog 

View the full changelog

* New feature: Working Webcrawler
  This is the first 1.x release with working web crawler.
  The crawler consists of two parts: a spider and a parser.
  The spider visits a given initial URL (i.e. a website)
  and downloads its HTML code.
  The parser scans the code for 'href' links to other pages
  (internal or external) and adds them to a queue of URLs
  (called frontier).
  As URLs are added in the queue, the spider visits them and
  downloads their HTML which is scanned for more links by the
  parser, and so on...
  The end result is the 'network' of all visited webpages as
  nodes and their real links as edges.
  Please note that the parser searches for 'href' links only
  in the body section of the HTML code.
  To start the web crawler, go to menu Network > Web Crawler
  or press Shift+C. A dialog will appear, where you must
  enter the initial web page (seed).

* Bugfixes:
  #1453743 wrong clustering coefficient calculation
  #1241239 1.x: crawler not working
  #1393926 Web Crawler bug on process full crawling of a
     seed, and bug in different parameters
  #1443965 SocNetV crashes when I try to open saved networks.
  #1388224 Web crawler menu option disabled

0 blueprints and 0 bugs targeted

There are no feature specifications or bug tasks targeted to this milestone. The project's maintainer, driver, or bug supervisor can target specifications and bug tasks to this milestone to track the things that are expected to be completed for the release.

This milestone contains Public information
Everyone can see this information.