- Sources: yweb/crawlrank/parser/src/filter.[h,cpp]
- Test: yweb/crawlrank/parser/filtertest/filtertest.cpp
- Yandex main filter sources: yweb/robot/filter/*
Results
Yandex feeds combined sorted and unique items picked.
Input file:
Feeds
Output file:
Result (two columns[tab separated]: url, filter result)
Total items: 600330
Spam filter results:
URL_BAD_NORMALIZEURL = 1380
URL_CHECK_GARBAGE = 451
URL_CHECK_MIRROR = 46810
URL_CHECK_REGEXP = 10546
UrlFilterBadFormat = 1821
UrlFilterBadScheme = 6936
UrlFilterDomain = 22835
UrlFilterEmpty = 11
UrlFilterExtDomain = 609
UrlFilterHost = 42
UrlFilterMax = 1
UrlFilterOpaque = 21
UrlFilterSuffix = 1543
UrlFilterUrl = 33
Good = 508671
Interpretation of results:
From yweb/robot/mergelib/filterstat.h -
PRINT_STAT(UrlFilterOK, "good urls");
PRINT_STAT(UrlFilterEmpty, "empty");
PRINT_STAT(UrlFilterBadScheme, "bad scheme");
PRINT_STAT(UrlFilterBadFormat, "bad format");
PRINT_STAT(UrlFilterBadPort, "bad port");
PRINT_STAT(UrlFilterBadPath, "bad path");
PRINT_STAT(UrlFilterOpaque, "bad scheme");
PRINT_STAT(UrlFilterBadAuth, "bad auth");
PRINT_STAT(UrlFilterTooLong, "too long");
PRINT_STAT(UrlFilterHost, "on hosts disabled - filter");
PRINT_STAT(UrlFilterUrl, "disabled - filter");
PRINT_STAT(UrlFilterSuffix, "has bad suffix");
PRINT_STAT(UrlFilterDomain, "on domain disabled - filter");
PRINT_STAT(UrlFilterExtDomain, "on domain not allowed - filter");
PRINT_STAT(UrlFilterPort, "on port disabled - filter");
PRINT_STAT(URL_CHECK_NOTABS, "not absolute");
PRINT_STAT(URL_CHECK_NOTHTTP, "scheme not http");
PRINT_STAT(URL_CHECK_LONGHOST, "netloc too long");
PRINT_STAT(URL_CHECK_MIRROR, "on mirrors");
PRINT_STAT(URL_CHECK_DEEPDIR, "path too deep");
PRINT_STAT(URL_CHECK_DUPDIRS, "too many repetitions in path");
PRINT_STAT(URL_CHECK_REGEXP, "disabled - regexp");
PRINT_STAT(URL_CHECK_EXT_REGEXP, "disabled - ext regexp");
PRINT_STAT(URL_CHECK_HOPS, "disabled - hops");
PRINT_STAT(URL_CHECK_CLEAN_PARAM, "disabled - clean-param");
PRINT_STAT(URL_CHECK_ROBOTS, "disabled - robots.txt");
PRINT_STAT(URL_CHECK_OLDDELETED, "too old deleted");
PRINT_STAT(URL_CHECK_PENALTY, "on dead hosts");
PRINT_STAT(URL_CHECK_FOREIGN, "on foreign hosts");
PRINT_STAT(URL_CHECK_MANUAL, "deleted manualy");
PRINT_STAT(URL_CHECK_POLICY, "deleted - policy");
PRINT_STAT(URL_CHECK_BIGHOPS, "too big hops deleted");
PRINT_STAT(URL_CHECK_FAST_FORCE_CRAWL, "url injected into fast robot with flag FORCECRAWL, ignore any filtering");
PRINT_STAT(URL_CHECK_POLICY_MIME_TYPE, "deleted - policy mime type");
Interesting facts
- All urls whose scheme is not http (for example: https) are rejected as UrlFilterBadScheme.
- In most cases, urls rejected by NormalizeUrl (used by orange) are also rejected by filter parser. However, the filter may reject more urls based on other factors such as url scheme (http or not), host name, hostname length, etc.
- UrlFilterBadFormat includes (but is not limited to) the following rules - (a) hostlen > HOST_MAX, (b) host name does not contain character '.'
- URL_CHECK_GARBAGE: Url must contain only alphanumeric and following special characters: !@\"#$%^:&?*()-_+=[]{}|\\/';><?.,
- Other rules are dictionary based which are specified in appropriate filter files.