Changelog updated with 0.5.1 changes.

Version changed to 0.5.1 in a Makefile.
Version changed to 0.5.1.
2021-05-04 04:04:22 +04:00 · 2021-05-04 03:58:46 +04:00 · 2021-05-04 03:58:02 +04:00 · 2021-05-04 03:56:59 +04:00 · 2021-05-04 03:55:32 +04:00 · 2021-05-03 02:45:41 +04:00
17 changed files with 549 additions and 329 deletions
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@ -1,5 +1,92 @@
 # Changelog
 ## 0.5.1 - 2021-05-04
 ## Added
 - Message when a file cannot be retrieved.
 ## Fixed
 - Removed excessive hash comparison when files has same name;
 - A string forgotten to set to be a f-string, so now it displays a reason of why
  thread wasn't found.
 ## 0.5.0 - 2021-05-03
 ## Added
 - Now program makes use of skip_posts argument. Use CLI option `-S <number>`
  or `--skip-posts <number>` to set how much posts you want to skip.
 ## Changed
 - Better, minified messages;
 - Fixed inheritance of `Scraper`'s subclasses and its sane rewrite that led to
  future easy extension with way less repeating.
 - Added a general class `TinyboardLikeParser` that implements post parser for
  all imageboards based on it or the ones that have identical JSON API. From now
  on all such generalisation classes will end with `*LikeParser`;
 - Changed `file_base_url` for 8kun.top.
 ## Removed
 - Support for Lolifox, since it's gone.
 ## 0.4.1 - 2020-12-08
 ## Fixed
 - Now HTTPException from http.client and URLError from urllib.request
  are handled;
 - 2ch.hk's stickers handling.
 ## 0.4.0 - 2020-11-18
 ### Added
 - For 2ch.hk check for if a file is a sticker was added;
 - Encoding for `!op.txt` file was explicitly set to `utf-8`;
 - Handling of connection errors was added so now program won't crash if file 
  doesn't exist or not accessible for any other reason and if any damaged files
  was created then they will be removed;
 - Added 3 retries if file was damaged during downloading;
 - To a scraper was added matching of hashes of two files that happen to share
  same name and size, but hash reported by an imageboard is not the same as of
  a file. It results in excessive downloading and hash calculations. Hopefully,
  that only the case for 2ch.hk.
 ### Changed
 - FileInfo class is now a frozen dataclass for memory efficiency.
 ### Fixed
 - Found that arguments for match function that matches for `image.ext` pattern
  were mixed up in places all over the parsers;
 - Also for 2ch.hk checking for if `sub` and `com` was changed to `subject` and
  `comment`.
 ## 0.3.0 - 2020-09-09
 ### Added
 - Parser for lolifox.cc.
 ### Removed
 - BasicScraper. Not needed anymore, there is a faster threaded version.
 ### Fixed
 - Now User-Agent is correctly applied everywhere.
 ## 0.2.2 - 2020-07-20
 ### Added
 - Parser for 8kun.top.
 ### Changed
 - The way of comparison if that site is supported to just looking for a
  substring.
 - Edited regex that checks if filename is just an "image.ext" so it only checks
  if after "image." only goes 1 to 4 characters.
 ### Notes
 - Consider that issue with size on 2ch.hk. Usually it really tells the size in
  kB. The problem is that sometimes it just wrong.
 ## 0.2.1 - 2020-07-18
 ### Changed
 - Now program tells you what thread doesn't exist or about to be scraped. That
  is useful in batch processing with scripts.
 ## 0.2.0 - 2020-07-18
 ### Added
 - Threaded version of the scraper, so now it is fast as heck!
--- a/2
+++ b/2
@ -1,7 +1,7 @@
 build: scrapthechan README.md setup.cfg
 	python setup.py sdist bdist_wheel
 install:
-	python -m pip install --upgrade dist/scrapthechan-0.2.0-py3-none-any.whl --user
+	python -m pip install --upgrade dist/scrapthechan-0.5.1-py3-none-any.whl --user
 uninstall:
 	# We change directory so pip uninstall will run, it'll fail otherwise.
 	@cd ~/
--- a/README.md
+++ b/README.md
@ -1,8 +1,8 @@
 This is a tool for scraping files from imageboards' threads.
-It extracts the files from a JSON version of a thread. And then downloads 'em
+It extracts the files from a JSON representation of a thread. And then downloads
-in a specified output directory or if it isn't specified then creates following
+'em in a specified output directory or if it isn't specified then creates
-directory hierarchy in a working directory:
+following directory hierarchy in a working directory:
    <imageboard name>
    |-<board name>
@ -24,9 +24,24 @@ separately. E.g. `4chan b 1100500`.
 `-o`, `--output-dir` -- output directory where all files will be dumped to.
-`--no-op` -- by default OP's post will be saved in a `!op.txt` file. This flag
+`-N`, `--no-op` -- by default OP's post will be saved in a `!op.txt` file. This
-disables this behaviour. I desided to put an `!` in a name so this file will be
+flag disables this behaviour. An exclamation mark `!` in a name is for so this
-on the top in a directory listing.
+file will be on the top of a directory listing.
-`-v`, `--version` prints the version of the program, and `-h`, `--help` prints
+`-S <num>`, `--skip-posts <num>` -- skip given number of posts.
-help for a program.
+
 `-v`, `--version` prints the version of the program.
 `-h`, `--help` prints help for a program.
 # Supported imageboards
 - [4chan.org](https://4chan.org) since 0.1.0
 - [lainchan.org](https://lainchan.org) since 0.1.0
 - [2ch.hk](https://2ch.hk) since 0.1.0
 - [8kun.top](https://8kun.top) since 0.2.2
 # TODO
 - Sane rewrite of a program;
 - Thread watcher.
--- a/scrapthechan/init.py
+++ b/scrapthechan/init.py
@ -1,13 +1,16 @@
-__date__ = "18 Jule 2020"
+__date__ = "4 May 2021"
-__version__ = "0.2.0"
+__version__ = "0.5.1"
 __author__ = "Alexander \"Arav\" Andreev"
 __email__ = "me@arav.top"
-__copyright__ = f"Copyright (c) 2020 {__author__} <{__email__}>"
+__copyright__ = f"Copyright (c) 2020,2021 {__author__} <{__email__}>"
 __license__ = \
 """This program is licensed under the terms of the MIT license.
 For a copy see COPYING file in a directory of the program, or 
 see <https://opensource.org/licenses/MIT>"""
 USER_AGENT = f"ScrapTheChan/{__version__}"
 VERSION = \
-    f"ScrapTheChan ver. {__version__} ({__date__})\n\n{__copyright__}\n"\
+    f"ScrapTheChan ver. {__version__} ({__date__})\n{__copyright__}\n"\
          f"\n{__license__}"
--- a/scrapthechan/cli/scraper.py
+++ b/scrapthechan/cli/scraper.py
@ -3,21 +3,20 @@ from os import makedirs
 from os.path import join, exists
 from re import search
 from sys import argv
-from typing import List
+from typing import List, Optional
 from scrapthechan import VERSION
-from scrapthechan.parser import Parser, ParserThreadNotFoundError
+from scrapthechan.parser import Parser, ThreadNotFoundError
 from scrapthechan.parsers import get_parser_by_url, get_parser_by_site, \
 								 SUPPORTED_IMAGEBOARDS
 #from scrapthechan.scrapers.basicscraper import BasicScraper
 from scrapthechan.scrapers.threadedscraper import ThreadedScraper
 __all__ = ["main"]
-USAGE = \
+USAGE: str = \
-"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
+f"""Usage: scrapthechan [OPTIONS] (URL | IMAGEBOARD BOARD THREAD)
 Options:
 \t-h,--help             -- print this help and exit;
@ -27,6 +26,7 @@ Options:
 \t                         <imageboard>/<board>/<thread>;
 \t-N,--no-op            -- by default OP's post will be written in !op.txt file. This
 \t                         option disables this behaviour;
 \t-S,--skip-posts <num> -- skip given number of posts.
 Arguments:
 \tURL        -- URL of a thread;
@ -34,18 +34,18 @@ Arguments:
 \tBOARD      -- short name of a board. E.g. b;
 \tTHREAD     -- ID of a thread. E.g. 100500.
-Supported imageboards: 4chan.org, 2ch.hk, lainchan.org.
+Supported imageboards: {', '.join(SUPPORTED_IMAGEBOARDS)}.
 """
-def parse_common_arguments(args: str) -> dict:
+def parse_common_arguments(args: str) -> Optional[dict]:
 	r = r"(?P<help>-h|--help)|(?P<version>-v|--version)"
-    argd = search(r, args)
+	args = search(r, args)
-    if not argd is None:
+	if not args is None:
-        argd = argd.groupdict()
+		args = args.groupdict()
 		return {
-            "help": not argd["help"] is None,
+			"help": not args["help"] is None,
-            "version": not argd["version"] is None }
+			"version": not args["version"] is None }
 	return None
 def parse_arguments(args: str) -> dict:
@ -54,15 +54,21 @@ def parse_arguments(args: str) -> dict:
 	if not link is None:
 		link = link.groupdict()
 	out_dir = search(r"(?=(-o|--output-dir) (?P<outdir>\S+))", args)
 	skip_posts = search(r"(?=(-S|--skip-posts) (?P<skip>\d+))", args)
 	return {
 		"site": None if link is None else link["site"],
 		"board": None if link is None else link["board"],
 		"thread": None if link is None else link["thread"],
 		"skip-posts": None if skip_posts is None else int(skip_posts.group('skip')),
 		"no-op": not search(r"-N|--no-op", args) is None,
 		"output-dir": None if out_dir is None \
 					  else out_dir.groupdict()["outdir"] }
 def main() -> None:
 	if len(argv) == 1:
 		print(USAGE)
 		exit()
 	cargs = parse_common_arguments(' '.join(argv[1:]))
 	if not cargs is None:
 		if cargs["help"]:
@ -79,19 +85,22 @@ def main() -> None:
 		exit()
 	try:
-		parser = get_parser_by_site(args["site"], args["board"], args["thread"])
+		if not args["skip-posts"] is None:
 			parser = get_parser_by_site(args["site"], args["board"],
 										args["thread"], args["skip-posts"])
 		else:
 			parser = get_parser_by_site(args["site"], args["board"],
 										args["thread"])
 	except NotImplementedError as ex:
 		print(f"{str(ex)}.")
 		print(f"Supported image boards are {', '.join(SUPPORTED_IMAGEBOARDS)}")
 		exit()
-	except ParserThreadNotFoundError:
+	except ThreadNotFoundError as e:
-		print(f"Thread is no longer exist.")
+		print(f"Thread {args['site']}/{args['board']}/{args['thread']} " \
 			   f"not found. Reason: {e.reason}")
 		exit()
-	flen = len(parser.files)
+	files_count = len(parser.files)
 	print(f"There are {flen} files in this thread.")
 	if not args["output-dir"] is None:
 		save_dir = args["output-dir"]
@ -99,25 +108,26 @@ def main() -> None:
 		save_dir = join(parser.imageboard, parser.board,
 						parser.thread)
-	print(f"They will be saved in {save_dir}.")
+	print(f"{files_count} files in " \
 		  f"{args['site']}/{args['board']}/{args['thread']}. " \
 		  f"They're going to {save_dir}. ", end="")
 	makedirs(save_dir, exist_ok=True)
 	if not args["no-op"]:
 		print("Writing OP... ", end='')
 		if parser.op is None:
-			print("No text's there.")
+			print("OP's empty.")
 		elif not exists(join(save_dir, "!op.txt")):
-			with open(join(save_dir, "!op.txt"), 'w') as opf:
+			with open(join(save_dir, "!op.txt"), 'w', encoding='utf-8') as opf:
 				opf.write(f"{parser.op}\n")
-			print("Done.")
+			print("OP's written.")
 		else:
-			print("Exists.")
+			print("OP exists.")
 	scraper = ThreadedScraper(save_dir, parser.files, \
-		lambda i: print(f"{i}/{flen}", end="\r"))
+		lambda i: print(f"{i}/{files_count}", end="\r"))
 	scraper.run()
--- a/scrapthechan/fileinfo.py
+++ b/scrapthechan/fileinfo.py
@ -1,23 +1,23 @@
-"""FileInfo object stores all needed information about a file."""
+"""FileInfo object stores information about a file."""
 from dataclasses import dataclass
 __all__ = ["FileInfo"]
@dataclass(frozen=True, order=True)
 class FileInfo:
-	"""Stores all needed information about a file.
+	"""Stores information about a file.
-    Arguments:
+	Fields:
 		- `name`           -- name of a file;
 		- `size`           -- size of a file;
-        - `dlurl`      -- full download URL for a file;
+		- `download_url`   -- full download URL for a file;
 		- `hash_value`     -- hash sum of a file;
-        - `hash_algo`  -- hash algorithm used (e.g. md5).
+		- `hash_algorithm` -- hash algorithm used (e.g. md5).
 	"""
-	def __init__(self, name: str, size: int, dlurl: str,
+	name: str
-        hash_value: str, hash_algo: str) -> None:
+	size: int
-		self.name = name
+	download_url: str
-		self.size = size
+	hash_value: str
-		self.dlurl = dlurl
+	hash_algorithm: str
 		self.hash_value = hash_value
 		self.hash_algo = hash_algo
--- a/scrapthechan/parser.py
+++ b/scrapthechan/parser.py
@ -4,16 +4,22 @@ from itertools import chain
 from json import loads
 from re import findall, match
 from typing import List, Optional
-from urllib.request import urlopen, urlretrieve
+from urllib.request import urlopen, Request, HTTPError
 from scrapthechan import USER_AGENT
 from scrapthechan.fileinfo import FileInfo
-__all__ = ["Parser", "ParserThreadNotFoundError"]
+__all__ = ["Parser", "ThreadNotFoundError"]
-class ParserThreadNotFoundError(Exception):
+class ThreadNotFoundError(Exception):
-	pass
+	def __init__(self, reason: str = ""):
 		self._reason = reason
 	@property
 	def reason(self) -> str:
 		return self._reason
 class Parser:
@ -24,28 +30,42 @@ class Parser:
 	Arguments:
 		board      -- is a name of a board on an image board;
-		thread     -- is a name of a thread inside a board;
+		thread     -- is an id of a thread inside a board;
 		posts      -- is a list of posts in form of dictionaries exported from a JSON;
 		skip_posts -- number of posts to skip.
 	All the extracted files will be stored as the `FileInfo` objects."""
 	__url_thread_json: str = "https://example.org/{board}/{thread}.json"
 	__url_file_link: str = None
-	def __init__(self, board: str, thread: str, posts: List[dict],
+	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
-		self._board = board
+
-		self._thread = thread
+		self._board: str = board
-		self._op_post = posts[0]
+		self._thread: str = thread
-		if not skip_posts is None:
+		self._posts = self._extract_posts_list(self._get_json())
-			posts = posts[skip_posts:]
+		self._op_post: dict = self._posts[0]
 		self._posts = self._posts[skip_posts:] if not skip_posts is None else self._posts
 		self._files = list(chain.from_iterable(filter(None, \
-			map(self._parse_post, posts))))
+			map(self._parse_post, self._posts))))
 	@property
 	def json_thread_url(self) -> str:
 		raise NotImplementedError
 	@property
 	def file_base_url(self) -> str:
 		raise NotImplementedError
 	@property
 	def subject_field(self) -> str:
 		return "sub"
 	@property
 	def comment_field(self) -> str:
 		return "com"
 	@property
 	def imageboard(self) -> str:
 		"""Returns image board's name."""
-		return NotImplementedError
+		raise NotImplementedError
 	@property
 	def board(self) -> str:
@ -61,21 +81,40 @@ class Parser:
 	def op(self) -> str:
 		"""Returns OP's post as combination of subject and comment separated
 		by a new line."""
-		raise NotImplementedError
+		op = ""
 		if self.subject_field in self._op_post:
 			op = f"{self._op_post[self.subject_field]}\n"
 		if self.comment_field in self._op_post:
 			op += self._op_post[self.comment_field]
 		return op if not op == "" else None
 	@property
 	def files(self) -> List[FileInfo]:
 		"""Returns a list of retrieved files as `FileInfo` objects."""
 		return self._files
-	def _get_json(self, thread_url: str) -> dict:
+	def _extract_posts_list(self, lst: List) -> List[dict]:
-		"""Gets JSON version of a thread and converts it in a dictionary."""
+		"""This method must be overridden in child classes where you specify
-		try:
+		a path in a JSON document where posts are stored. E.g., on 4chan this is
-			with urlopen(thread_url) as url:
+		['posts'], and on 2ch.hk it's ['threads'][0]['posts']."""
-				return loads(url.read().decode('utf-8'))
+		return lst
 		except:
 			raise ParserThreadNotFoundError
-	def _parse_post(self, post: dict) -> List[FileInfo]:
+	def _get_json(self) -> dict:
-		"""Parses a single post and extracts files into `FileInfo` object."""
+		"""Retrieves a JSON representation of a thread and converts it in
 		a dictionary."""
 		try:
 			thread_url = self.json_thread_url.format(board=self._board, \
 				thread=self._thread)
 			req = Request(thread_url, headers={'User-Agent': USER_AGENT})
 			with urlopen(req) as url:
 				return loads(url.read().decode('utf-8'))
 		except HTTPError as e:
 			raise ThreadNotFoundError(str(e))
 		except Exception as e:
 			raise e
 	def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
 		"""Parses a single post and extracts files into `FileInfo` object.
 		Single object is wrapped in a list for convenient insertion into
 		a list."""
 		raise NotImplementedError
--- a/scrapthechan/parsers/init.py
+++ b/scrapthechan/parsers/init.py
@ -1,6 +1,6 @@
 """Here are defined the JSON parsers for imageboards."""
 from re import search
-from typing import List
+from typing import List, Optional
 from scrapthechan.parser import Parser
@ -8,27 +8,31 @@ from scrapthechan.parser import Parser
 __all__ = ["SUPPORTED_IMAGEBOARDS", "get_parser_by_url", "get_parser_by_site"]
-SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk"]
+URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
 SUPPORTED_IMAGEBOARDS: List[str] = ["4chan.org", "lainchan.org", "2ch.hk", \
 	"8kun.top"]
-def get_parser_by_url(url: str) -> Parser:
+def get_parser_by_url(url: str, skip_posts: Optional[int] = None) -> Parser:
 	"""Parses URL and extracts from it site name, board and thread.
 	And then returns initialised Parser object for detected imageboard."""
 	URLRX = r"https?:\/\/(?P<s>[\w\.]+)\/(?P<b>\w+)\/(?:\w+)?\/(?P<t>\w+)"
 	site, board, thread = search(URLRX, url).groups()
-	return get_parser_by_site(site, board, thread)
+	return get_parser_by_site(site, board, thread, skip_posts)
-def get_parser_by_site(site: str, board: str, thread: str) -> Parser:
+def get_parser_by_site(site: str, board: str, thread: str,
 					   skip_posts: Optional[int] = None) -> Parser:
 	"""Returns an initialised parser for `site` with `board` and `thread`."""
-	if site in ['boards.4chan.org', 'boards.4channel.org',
+	if '4chan' in site:
 				'4chan', '4chan.org']:
 		from .fourchan import FourChanParser
-		return FourChanParser(board, thread)
+		return FourChanParser(board, thread, skip_posts)
-	elif site in ['lainchan.org', 'lainchan']:
+	elif 'lainchan' in site:
 		from .lainchan import LainchanParser
-		return LainchanParser(board, thread)
+		return LainchanParser(board, thread, skip_posts)
-	elif site in ['2ch.hk', '2ch']:
+	elif '2ch' in site:
 		from .dvach import DvachParser
-		return DvachParser(board, thread)
+		return DvachParser(board, thread, skip_posts)
 	elif '8kun' in site:
 		from .eightkun import EightKunParser
 		return EightKunParser(board, thread, skip_posts)
 	else:
 		raise NotImplementedError(f"Parser for {site} is not implemented")
--- a/scrapthechan/parsers/dvach.py
+++ b/scrapthechan/parsers/dvach.py
@ -10,39 +10,54 @@ __all__ = ["DvachParser"]
 class DvachParser(Parser):
 	"""JSON parser for 2ch.hk image board."""
 	__url_thread_json = "https://2ch.hk/{board}/res/{thread}.json"
 	__url_file_link = "https://2ch.hk"
 	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
-		posts = self._get_json(self.__url_thread_json.format(board=board, \
+		super().__init__(board, thread, skip_posts)
-			thread=thread))['threads'][0]['posts']
+
-		super(DvachParser, self).__init__(board, thread, posts, skip_posts)
+	@property
 	def json_thread_url(self) -> str:
 		return "https://2ch.hk/{board}/res/{thread}.json"
 	@property
 	def file_base_url(self) -> str:
 		return "https://2ch.hk"
 	@property
 	def subject_field(self) -> str:
 		return "subject"
 	@property
 	def comment_field(self) -> str:
 		return "comment"
 	@property
 	def imageboard(self) -> str:
 		return "2ch.hk"
-	@property
+	def _extract_posts_list(self, lst: List) -> List[dict]:
-	def op(self) -> Optional[str]:
+		return lst['threads'][0]['posts']
 		op = ""
 		if 'sub' in self._op_post:
 			op = f"{self._op_post['subject']}\n"
 		if 'com' in self._op_post:
 			op += self._op_post['comment']
 		return op if not op == "" else None
 	def _parse_post(self, post) -> Optional[List[FileInfo]]:
 		if not 'files' in post: return None
 		files = []
 		for f in post['files']:
-			if match(f['fullname'], r"^image\.\w+$") is None:
+			if not 'sticker' in f:
 				if match(r"^image\.\w+$", f['fullname']) is None:
 					fullname = f['fullname']
 				else:
 					fullname = f['name']
 			else:
 				fullname = f['name']
 			# Here's same thing as 4chan. 2ch.hk also has md5 field, so it is
 			# completely fine to hardcode `hash_algo`.
 			if 'md5' in f:
 				files.append(FileInfo(fullname, f['size'],
-				f"{self.__url_file_link}{f['path']}",
+							 f"{self.file_base_url}{f['path']}",
 							 f['md5'], 'md5'))
 			else:
 				files.append(FileInfo(fullname, f['size'],
 							 f"{self.file_base_url}{f['path']}",
 							 None, None))
 		return files
--- a/scrapthechan/parsers/eightkun.py
+++ b/scrapthechan/parsers/eightkun.py
@ -0,0 +1,25 @@
 from typing import Optional
 from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
 __all__ = ["EightKunParser"]
 class EightKunParser(TinyboardLikeParser):
 	"""JSON parser for 8kun.top image board."""
 	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
 		super().__init__(board, thread, skip_posts)
 	@property
 	def imageboard(self) -> str:
 		return "8kun.top"
 	@property
 	def json_thread_url(self) -> str:
 		return "https://8kun.top/{board}/res/{thread}.json"
 	@property
 	def file_base_url(self) -> str:
 		return "https://media.8kun.top/file_dl/{filename}"
--- a/scrapthechan/parsers/fourchan.py
+++ b/scrapthechan/parsers/fourchan.py
@ -1,51 +1,25 @@
-from re import match
+from typing import Optional
 from typing import List, Optional
-from scrapthechan.fileinfo import FileInfo
+from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
 from scrapthechan.parser import Parser
 __all__ = ["FourChanParser"]
-class FourChanParser(Parser):
+class FourChanParser(TinyboardLikeParser):
 	"""JSON parser for 4chan.org image board."""
 	__url_thread_json = "https://a.4cdn.org/{board}/thread/{thread}.json"
 	__url_file_link = "https://i.4cdn.org/{board}/{filename}"
 	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
-		posts = self._get_json(self.__url_thread_json.format(board=board, \
+		super().__init__(board, thread, skip_posts)
 			thread=thread))['posts']
 		super(FourChanParser, self).__init__(board, thread, posts, skip_posts)
 	@property
 	def imageboard(self) -> str:
 		return "4chan.org"
 	@property
-	def op(self) -> Optional[str]:
+	def json_thread_url(self) -> str:
-		op = ""
+		return "https://a.4cdn.org/{board}/thread/{thread}.json"
 		if 'sub' in self._op_post:
 			op = f"{self._op_post['sub']}\n"
 		if 'com' in self._op_post:
 			op += self._op_post['com']
 		return op if not op == "" else None
-	def _parse_post(self, post: dict) -> List[FileInfo]:
+	@property
-		if not 'tim' in post: return None
+	def file_base_url(self) -> str:
-
+		return "https://i.4cdn.org/{board}/{filename}"
 		dlfname = f"{post['tim']}{post['ext']}"
 		if "filename" in post:
 			if match(post['filename'], r"^image\.\w+$") is None:
 				filename = dlfname
 			else:
 				filename = f"{post['filename']}{post['ext']}"
 		# Hash algorithm is hardcoded since it is highly unlikely that it will
 		# be changed in foreseeable future. And if it'll change then this line
 		# will be necessarily updated anyway.
 		return [FileInfo(filename, post['fsize'],
 			self.__url_file_link.format(board=self.board, filename=dlfname),
 			post['md5'], 'md5')]
--- a/scrapthechan/parsers/lainchan.py
+++ b/scrapthechan/parsers/lainchan.py
@ -1,66 +1,25 @@
-from re import match
+from typing import Optional
 from typing import List, Optional
-from scrapthechan.parser import Parser
+from scrapthechan.parsers.tinyboardlike import TinyboardLikeParser
 from scrapthechan.fileinfo import FileInfo
 __all__ = ["LainchanParser"]
-class LainchanParser(Parser):
+class LainchanParser(TinyboardLikeParser):
-	"""JSON parser for lainchan.org image board.
+	"""JSON parser for lainchan.org image board."""
 	JSON structure is identical to 4chan.org's, so this parser is just inherited
 	from 4chan.org's parser and only needed things are redefined.
 	"""
 	__url_thread_json = "https://lainchan.org/{board}/res/{thread}.json"
 	__url_file_link = "https://lainchan.org/{board}/src/{filename}"
 	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
-		posts = self._get_json(self.__url_thread_json.format(board=board, \
+		super().__init__(board, thread, skip_posts)
 			thread=thread))['posts']
 		super(LainchanParser, self).__init__(board, thread, posts, skip_posts)
 	@property
 	def imageboard(self) -> str:
 		return "lainchan.org"
 	@property
-	def op(self) -> Optional[str]:
+	def json_thread_url(self) -> str:
-		op = ""
+		return "https://lainchan.org/{board}/res/{thread}.json"
 		if 'sub' in self._op_post:
 			op = f"{self._op_post['sub']}\n"
 		if 'com' in self._op_post:
 			op += self._op_post['com']
 		return op if not op == "" else None
-	def _parse_post(self, post) -> List[FileInfo]:
+	@property
-		if not 'tim' in post: return None
+	def file_base_url(self) -> str:
-
+		return "https://lainchan.org/{board}/src/{filename}"
 		dlfname = f"{post['tim']}{post['ext']}"
 		if "filename" in post:
 			if match(post['filename'], r"^image\.\w+$") is None:
 				filename = dlfname
 			else:
 				filename = f"{post['filename']}{post['ext']}"
 		files = []
 		files.append(FileInfo(filename, post['fsize'],
 			self.__url_file_link.format(board=self.board, filename=dlfname),
 			post['md5'], 'md5'))
 		if "extra_files" in post:
 			for f in post["extra_files"]:
 				dlfname = f"{f['tim']}{f['ext']}"
 				if "filename" in post:
 					if match(post['filename'], r"^image\.\w+$") is None:
 						filename = dlfname
 					else:
 						filename = f"{post['filename']}{post['ext']}"
 				dlurl = self.__url_file_link.format(board=self.board, \
 					filename=dlfname)
 				files.append(FileInfo(filename, f['fsize'], \
 					dlurl, f['md5'], 'md5'))
 		return files
--- a/scrapthechan/parsers/tinyboardlike.py
+++ b/scrapthechan/parsers/tinyboardlike.py
@ -0,0 +1,51 @@
 from re import match
 from typing import List, Optional
 from scrapthechan.parser import Parser
 from scrapthechan.fileinfo import FileInfo
 __all__ = ["TinyboardLikeParser"]
 class TinyboardLikeParser(Parser):
 	"""Base parser for imageboards that are based on Tinyboard, or have similar
 	JSON API."""
 	def __init__(self, board: str, thread: str,
 				 skip_posts: Optional[int] = None) -> None:
 		super().__init__(board, thread, skip_posts)
 	def _extract_posts_list(self, lst: List) -> List[dict]:
 		return lst['posts']
 	def _parse_post(self, post: dict) -> Optional[List[FileInfo]]:
 		if not 'tim' in post: return None
 		dlfname = f"{post['tim']}{post['ext']}"
 		if "filename" in post:
 			if match(r"^image\.\w+$", post['filename']) is None:
 				filename = dlfname
 			else:
 				filename = f"{post['filename']}{post['ext']}"
 		files = []
 		files.append(FileInfo(filename, post['fsize'],
 			self.file_base_url.format(board=self.board, filename=dlfname),
 			post['md5'], 'md5'))
 		if "extra_files" in post:
 			for f in post["extra_files"]:
 				dlfname = f"{f['tim']}{f['ext']}"
 				if "filename" in post:
 					if match(r"^image\.\w+$", post['filename']) is None:
 						filename = dlfname
 					else:
 						filename = f"{post['filename']}{post['ext']}"
 				dlurl = self.file_base_url.format(board=self.board, \
 					filename=dlfname)
 				files.append(FileInfo(filename, f['fsize'], \
 					dlurl, f['md5'], 'md5'))
 		return files
--- a/scrapthechan/scraper.py
+++ b/scrapthechan/scraper.py
@ -1,21 +1,22 @@
-"""Base Scraper implementation."""
+"""Base class for all scrapers that will actually do the job."""
 from base64 import b64encode
 from os import remove, stat
 from os.path import exists, join, getsize
 import re
 from typing import List, Callable
-from urllib.request import urlretrieve, URLopener
+from urllib.request import urlretrieve, URLopener, HTTPError, URLError
 import hashlib
 from http.client import HTTPException
-from scrapthechan import __version__
+from scrapthechan import USER_AGENT
 from scrapthechan.fileinfo import FileInfo
 __all__ = ["Scraper"]
 class Scraper:
-    """Base scraper implementation.
+	"""Base class for all scrapers that will actually do the job.
 	Arguments:
 		save_directory             -- a path to a directory where file will be
@ -29,7 +30,8 @@ class Scraper:
 		self._save_directory = save_directory
 		self._files = files
 		self._url_opener = URLopener()
-        self._url_opener.version = f"ScrapTheChan/{__version__}"
+		self._url_opener.addheaders = [('User-Agent', USER_AGENT)]
 		self._url_opener.version = USER_AGENT
 		self._progress_callback = download_progress_callback
 	def run(self):
@ -62,35 +64,83 @@ class Scraper:
 						newname = f"{newname[:lbracket]}({int(num)+1})"
 		return newname
-    def _hash_file(self, filename: str, hash_algo: str = "md5",
+	def _hash_file(self, filepath: str, hash_algorithm: str = "md5",
 				   blocksize: int = 1048576) -> (str, str):
 		"""Compute hash of a file."""
-        hash_func = hashlib.new(hash_algo)
+		if hash_algorithm is None:
-        with open(filename, 'rb') as f:
+			return None
 		hash_func = hashlib.new(hash_algorithm)
 		with open(filepath, 'rb') as f:
 			buf = f.read(blocksize)
 			while len(buf) > 0:
 				hash_func.update(buf)
 				buf = f.read(blocksize)
-        return hash_func.hexdigest(), hash_func.digest()
+		return hash_func.hexdigest(), b64encode(hash_func.digest()).decode()
-    def _is_file_ok(self, f: FileInfo, filepath: str) -> bool:
+	def _check_file(self, f: FileInfo, filepath: str) -> bool:
 		"""Check if a file exist and isn't broken."""
 		if not exists(filepath):
 			return False
 		computed_size = getsize(filepath)
-        is_size_match = f.size == computed_size \
+		if not (f.size == computed_size \
-                        or f.size == round(computed_size / 1024)
+				or f.size == round(computed_size / 1024)):
-        hexdig, dig = self._hash_file(filepath, f.hash_algo)
+			return False
-        is_hash_match = f.hash_value == hexdig \
+		if not f.hash_algorithm is None:
-                        or f.hash_value == b64encode(dig).decode()
+			hexdig, dig = self._hash_file(filepath, f.hash_algorithm)
-        return is_size_match and is_hash_match
+			return f.hash_value == hexdig or f.hash_value == dig
 		return True
 	def _download_file(self, f: FileInfo):
 		"""Download a single file."""
 		is_same_filename = False
 		filepath = join(self._save_directory, f.name)
-        if self._is_file_ok(f, filepath):
+		orig_filepath = filepath
-            return True
+		if self._check_file(f, filepath):
 			return
 		elif exists(filepath):
 			is_same_filename = True
 			filepath = join(self._save_directory, \
 				self._same_filename(f.name, self._save_directory))
-        self._url_opener.retrieve(f.dlurl, filepath)
+		try:
 			retries = 3
 			while retries > 0:
 				self._url_opener.retrieve(f.download_url, filepath)
 				if not self._check_file(f, filepath):
 					remove(filepath)
 					retries -= 1
 				else:
 					break
 			if retries == 0:
 				print(f"Cannot retrieve {f.download_url}, {filepath}.")
 				return
 			if is_same_filename:
 				_, f1_dig = self._hash_file(orig_filepath, f.hash_algorithm)
 				_, f2_dig = self._hash_file(filepath, f.hash_algorithm)
 				if f1_dig == f2_dig:
 					remove(filepath)
 		except FileNotFoundError as e:
 		 	print("File Not Found", filepath)
 		except HTTPError as e:
 			print("HTTP Error", e.code, e.reason, f.download_url)
 			if exists(filepath):
 				remove(filepath)
 		except HTTPException:
 			print("HTTP Exception for", f.download_url)
 			if exists(filepath):
 				remove(filepath)
 		except URLError as e:
 			print("URL Error for", f.download_url)
 			if exists(filepath):
 				remove(filepath)
 		except ConnectionResetError:
 			print("Connection reset for", f.download_url)
 			if exists(filepath):
 				remove(filepath)
 		except ConnectionRefusedError:
 			print("Connection refused for", f.download_url)
 			if exists(filepath):
 				remove(filepath)
 		except ConnectionAbortedError:
 			print("Connection aborted for", f.download_url)
 			if exists(filepath):
 				remove(filepath)
--- a/scrapthechan/scrapers/basicscraper.py
+++ b/scrapthechan/scrapers/basicscraper.py
@ -1,15 +0,0 @@
 """Implementation of basic sequential one-threaded scraper that downloads
 files one by one."""
 from scrapthechan.scraper import Scraper
 __all__ = ["BasicScraper"]
 class BasicScraper(Scraper):
    def run(self):
        """Download files one by one."""
        for i, f in enumerate(self._files, start=1):
            if not self._progress_callback is None:
                self._progress_callback(i)
            self._download_file(f)
--- a/scrapthechan/scrapers/threadedscraper.py
+++ b/scrapthechan/scrapers/threadedscraper.py
@ -7,13 +7,14 @@ from multiprocessing.pool import ThreadPool
 from scrapthechan.scraper import Scraper
 from scrapthechan.fileinfo import FileInfo
 __all__ = ["ThreadedScraper"]
 class ThreadedScraper(Scraper):
 	def __init__(self, save_directory: str, files: List[FileInfo],
 		download_progress_callback: Callable[[int], None] = None) -> None:
-        super(ThreadedScraper, self).__init__(save_directory, files,
+		super().__init__(save_directory, files, download_progress_callback)
            download_progress_callback)
 		self._files_downloaded = 0
 		self._files_downloaded_mutex = Lock()
@ -24,8 +25,8 @@ class ThreadedScraper(Scraper):
 		pool.join()
 	def _thread_run(self, f: FileInfo):
 		if not self._progress_callback is None:
 			with self._files_downloaded_mutex:
 				self._files_downloaded += 1
            if not self._progress_callback is None:
 				self._progress_callback(self._files_downloaded)
 		self._download_file(f)
--- a/setup.cfg
+++ b/setup.cfg
@ -1,31 +1,33 @@
 [metadata]
 name = scrapthechan
 version = attr: scrapthechan.__version__
-description =
+description = Scrap the files from the imageboards.
    Scrap the files posted in a thread on an imageboard. Currently supports
    4chan.org, lainchan.org and 2ch.hk.
 long_description = file: README.md
 long_description_content_type = text/markdown
 author = Alexander "Arav" Andreev
 author_email = me@arav.top
-url = https://arav.top
+url = https://git.arav.top/Arav/ScrapTheChan
 keywords =
    scraper
    imageboard
-    4chan
+    4chan.org
-    2ch
+    2ch.hk
-    lainchan
+    lainchan.org
    8kun.top
 license = MIT
 license_file = COPYING
 classifiers =
-    Development Status :: 2 - Pre-Alpha
+    Development Status :: 3 - Alpha
    Environment :: Console
    Intended Audience :: End Users/Desktop
-    License :: Other/Proprietary License
+    License :: OSI Approved :: MIT License
    Natural Language :: English
    Operating System :: OS Independent
    Programming Language :: Python :: 3.7
-    Programming Language :: Python :: 3.8
+    Topic :: Communications :: BBS
    Topic :: Internet :: WWW/HTTP
    Topic :: Internet :: WWW/HTTP :: Dynamic Content :: Message Boards
    Topic :: Text Processing
    Topic :: Utilities
 [options]
Author	SHA1	Message	Date
Alexander Andreev	43909c2b29	Changelog updated with 0.5.1 changes.	2021-05-04 04:04:22 +04:00
Alexander Andreev	acbfaefa9c	Version changed to 0.5.1 in a Makefile.	2021-05-04 03:58:46 +04:00
Alexander Andreev	86ef44aa07	Version changed to 0.5.1.	2021-05-04 03:58:02 +04:00
Alexander Andreev	419fb2b673	Removed excessive comparison of hash. Added message when file cannot be retrieved.	2021-05-04 03:56:59 +04:00
Alexander Andreev	0287d3a132	Turned a string into f-string.	2021-05-04 03:55:32 +04:00
Alexander Andreev	245e33f40d	README updated. lolifox.cc removed. Option --skip-posts added.	2021-05-03 02:45:41 +04:00
Alexander Andreev	e092c905b2	Makefile updated to version 0.5.0.	2021-05-03 02:44:37 +04:00
Alexander Andreev	90338073ed	Updated CHANGELOG with version 0.5.0.	2021-05-03 02:44:19 +04:00
Alexander Andreev	cdcc184de8	Lolifox removed. Development Status classifier is changed to Alpha. Python 3.7 classifier left to represent oldest supported version.	2021-05-03 02:43:49 +04:00
Alexander Andreev	b335891097	Copyright, date, and version are updated.	2021-05-03 02:41:32 +04:00
Alexander Andreev	1213cef776	Lolifox removed. Added skip_posts handling.	2021-05-03 02:40:57 +04:00
Alexander Andreev	78d4a62c17	IB parsers rewritten accordingly to fixed Parser class.	2021-05-03 02:40:21 +04:00
Alexander Andreev	f3ef07af68	Rewrite of Parser class because it was fucked up. Now there's no problems with inheritance and its subclasses now more pleasant to write. ThreadNotFoundError now has a reason field.	2021-05-03 02:38:46 +04:00
Alexander Andreev	6373518dc3	Added order=True for FIleInfo to make sure that order of fields is preserved.	2021-05-03 02:36:17 +04:00
Alexander Andreev	caf18a1bf0	Added option --skip-posts and messages are now takes just one line.	2021-05-03 02:35:31 +04:00
Alexander Andreev	751549f575	A new generalised class for all imageboards based on Tinyboard or having identical API.	2021-05-03 02:34:38 +04:00
Alexander Andreev	38b5740d73	Removing lolifox.cc parser because this board is dead.	2021-05-03 02:33:52 +04:00
Alexander Andreev	2f9d26427c	Now incrementing _files_downloaded happens when _progress_callback is set. And made super() with no args.	2021-05-03 02:33:14 +04:00
Alexander Andreev	e7cf2e7c4b	Added a missing return True statement in _check_file	2021-05-03 02:30:31 +04:00
Alexander Andreev	4f6f56ae7b	Version in a Makefile is changed to 0.4.1.	2021-04-28 02:50:38 +04:00
Alexander Andreev	503eb9959b	Version updated to 0.4.1.	2021-04-28 02:49:59 +04:00
Alexander Andreev	cb2e0d77f7	Changelog update for 0.4.1.	2021-04-28 02:49:26 +04:00
Alexander Andreev	93e442939a	Dvach's stickers handling.	2021-04-28 02:48:36 +04:00
Alexander Andreev	6022c9929a	Added HTTP and URL exceptions handling.	2021-04-28 02:47:41 +04:00
Alexander Andreev	f79abcc310	In classifiers licence was fixed and added more topics related to a program.	2020-11-25 03:37:24 +04:00
Alexander Andreev	9cdb510325	A little fix for README.	2020-11-25 03:36:31 +04:00
Alexander Andreev	986fdbe7a7	Handling of no arguments passed.	2020-11-19 01:30:47 +04:00
Alexander Andreev	2e6352cb13	Updated changelog.	2020-11-19 01:26:35 +04:00
Alexander Andreev	7b2fcf0899	Improved error handling, retries for damaged files.	2020-11-19 01:26:19 +04:00
Alexander Andreev	21837c5335	Updated changelog.	2020-11-19 00:09:56 +04:00
Alexander Andreev	b970973018	ConnectionResetError handling.	2020-11-19 00:09:39 +04:00
Alexander Andreev	6dab626084	Version is changed to 0.4.0.	2020-11-18 23:51:18 +04:00
Alexander Andreev	86b6278657	Updated changelog and readme.	2020-11-18 23:50:58 +04:00
Alexander Andreev	7754a90313	FileInfo is now a frozen dataclass for efficiency.	2020-11-18 23:48:38 +04:00
Alexander Andreev	bb47b50c5f	_is_file_ok now is _check_file and modified to be more efficient. Also added check for if files happened to share same name and size, but IB said wrong hash.	2020-11-18 23:47:26 +04:00
Alexander Andreev	8403fcf0f2	Now op file is explicitly in utf-8.	2020-11-18 23:45:06 +04:00
Alexander Andreev	647a787974	FIxed arguments for a match function.	2020-11-18 23:44:36 +04:00
Alexander Andreev	6a54b88498	sub and com ->subject and comment. Fixed arguments for match function.	2020-11-18 23:43:43 +04:00
Alexander Andreev	2043fc277f	No right to fuck up! Shit... Forgot third part of a version.	2020-09-09 04:39:33 +04:00
Alexander Andreev	a106d5b739	Added support for lolifox.cc. Fixed User-Agent usage, so it applied correctly everywhere now.	2020-09-09 04:34:41 +04:00
Alexander Andreev	7825b53121	Did a minor refactoring. Also combined two first lines that are printed for a thread into one.	2020-07-20 04:32:30 +04:00
Alexander Andreev	b26152f3ca	Moved User-Agent off to __init__ in its own variable.	2020-07-20 04:31:27 +04:00
Alexander Andreev	9ad9fcfd6f	Added supported IBs to readme.	2020-07-20 04:13:39 +04:00
Alexander Andreev	2fcd4f0aa7	Updated usage, so I don't have to edit it every time I add a new IB.	2020-07-20 04:13:12 +04:00
Alexander Andreev	bfaa9d2778	Reduced summary. Changed URL. Edited keywords to actual domains.	2020-07-20 04:11:38 +04:00
Alexander Andreev	371c6623e9	Updated changelog.	2020-07-20 03:51:41 +04:00
Alexander Andreev	520d88c76a	Parser for 8kun.top added. And I changed compares in __init__.	2020-07-20 03:45:51 +04:00
Alexander Andreev	93d2904a4f	Regex limited to up to 4 characters after first dot occured.	2020-07-20 03:44:48 +04:00
Alexander Andreev	6df9e573aa	Updated version to 0.2.2	2020-07-20 03:43:36 +04:00
Alexander Andreev	f21ff0aff5	Oh, fuck me. What a typo... xD	2020-07-20 02:55:54 +04:00
Alexander Andreev	c0282f3934	Changelog updated.	2020-07-18 05:10:31 +04:00
Alexander Andreev	4db2e1dc75	A little change of output.	2020-07-18 05:04:06 +04:00