Extendable web checker
This project is maintained by eghuro
Crawlcheck is a web crawler invoking plugins on received content. It's intended for verification of websites prior to deployment. The process of verification is customisable by configuration script that allows complex specification which plugins should check particular URIs and content-types.
Latest released version was 1.1 from Jul 5, 2018.
This project started as a software project at Charles University in Prague. There is also an undergraduate thesis about it (in Czech language) defended there in September 2017. Slides from defense (in Czech language) are also available.
Crawlcheck's engine currently runs on Python 3.5 or 3.6 and uses SQLite3 as a database backend. Crawlcheck uses a number of open source projects to work properly. Python dependencies are listed in requirements.txt
For a web report there's separate project.
0) You will need python3, python-pip and sqlite3, virtualenv, libmagic, libtidy, libxml2 and libxslt installed. All dev or devel versions.
1) Fetch sources
git clone https://github.com/eghuro/crawlcheck crawlcheck
2) Run install script
cd crawlcheck
pip install -r requirements.txt
Configuration file is a Configuration file is a YAML file defined as follows:
---
version: 1.6 # configuration format version
database: crawlcheck.sqlite # sqlite database file
maxDepth: 10 # max amount of links followed from any entry point (default: 0 meaning unlimited)
agent: "Crawlcheck/1.1" # user agent used (default: Crawlcheck/1.1)
logfile: cc.log # where to store logs
maxContentLength: 2000000 # max file size to download
pluginDir: plugin # where to look for plugins (including subfolders, default: 'plugin')
timeout: 1 # timeout for networking (default: 1)
cleandb: True # clean database before execution
initdb: True # initialize database
report: "http://localhost:5000" # report REST API
cleanreport: True # clean entries in report before sending current
maxVolume: 100000000 # max 100 MB of tmp files (default: sys.maxsize)
maxAttempts: 2 # attempts to download a web page (default: 3)
dbCacheLimit: 1000000 # cache up to 1M of DB queries
tmpPrefix: "Crawlcheck" # prefix for temporary file names with downloaded content (default: Crawlcheck)
tmpSuffix: "content" # suffix for temporary file names with downloaded content (default: content)
tmpDir: "/tmp/" # where to store temporary files (default: /tmp/)
dbCacheLimit: 100000 # amount of cached database queries (default: sys.maxsize)
urlLimit: 10000000 # limit on seen URIs
verifyHttps: True # verify HTTPS? (default: False)
cores: 2 # amount of cores available (eg. for paralel report payload generation)
recordParams: False # record request data or URL params? (default: True)
recordHeaders: False # record response headers? (default: True)
sitemap-file: "sitemap.xml" # where to store generated sitemap.xml
sitemap-regex: "https?://ksp.mff.cuni.cz(/.*)?" # regex for sitemap generator
yaml-out-file: "cc.yml" # where to write YAML report
report-file: "report" # where to write PDF report (.pdf will be added automatically)
# other parameters used by plugins written as ```key: value```
urls:
-
url: "http://mj.ucw.cz/vyuka/.+"
plugins: # which plugins are allowed for given URL
- linksFinder
- tidyHtmlValidator
- tinycss
- css_scraper
- formChecker
- seoimg
- seometa
- dupdeteict
- non_semantic_html
-
url: "http://mj.ucw.cz/" #test links (HEAD request) only
plugins:
filters: #Filters (plugins of category header and filter) that can be used
- depth
- robots
- contentLength
- canonical
- acceptedType
- acceptedUri
- uri_normalizer
- expectedType
# filters: True
# alternative option to allow all available filters
# can be passed on command line using --param
postprocess:
- sitemap_generator
- report_exporter
- yaml_exporter
# postprocess: True
# alternative option to allow all available postprocessors
# can be passed on command line using --param
entryPoints: # where to start
# Note, that once URI get's to the database it's no longer being requested
# (beware of repeated starts, if entry point remains in the database execution won't
# start from this entry point)
- "http://mj.ucw.cz/vyuka/"
Assuming you have gone through set-up and configuration, now run checker:
$ cd [root]/crawlcheck/src/
$ python checker/ [config.yml]
Note: [root]/crawlcheck
is where repository was cloned to, [config.xml]
stands for the configuration file path
There are currently 5 types of plugins: crawlers, checkers, headers, filters and postprocessors. Crawlers are specializing in discovering new links. Checkers check syntax of various files. Headers check HTTP headers and together with filters serve to customize the crawling process itself. Postprocessors are used to generate reports or other outputs from the application.
Crawlcheck is currently extended with the following plugins:
Go to crawlcheck/src/checker/plugin/
, create my_new_plugin.py
and my_new_plugin.yapsy-plugin
files there.
Fill out .yapsy-plugin file:
[Core]
Name = Human readable plugin name
Module = my_new_plugin
[Documentation]
Author = Your Name
Version = 0.0
Description = My New Plugin
For plugin itself you need to implement following:
from yapsy.IPlugin import IPlugin
from common import PluginType
from filter import FilterException # for headers and filters
class MyPlugin(IPlugin):
category = PluginType.CHECKER # pick appropriate type
id = myPlugin
contentTypes = ["text/html"] # accepted content types (checkers & crawlers)
def acceptType(ctype): # alternatively a method resolving more complex content-type rules
return True
def setJournal(self, journal):
# record journal somewhere - all categories
def setQueue(self, queue):
# record queue somewhere - if needed
def setConf(self, conf):
# record configuration - only headers and filters
def check(self, transaction):
# implement the checking logic here for crawlers and checkers
def filter(self, transaction):
# implement the filtering logic here for filters and headers
# raise FilterException to filter the transaction out
def setDb(self, db):
# record DB somewhere - only postprocessors
def process(self):
# implement the postprocessing logic here for postprocessor
See http://yapsy.sourceforge.net/IPlugin.html and http://yapsy.sourceforge.net/PluginManager.html#plugin-info-file-format for more details.
MIT
Copyright (c) 2015-2018 Alexandr Mansurov
Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions:
The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software.
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.