Lanz Mining

What's it about?

Crawl talkshow guests, descriptions, political party memberships and all other available information from ARD and ZDF public media's web presence. Crawlers and parsers support Markus Lanz, Maischberger, Maybrit Illner, Caren Miosga and Hart aber fair. Public media data should be publically available (public money, public data).

Requires

Python3.11
Postgres installed for development
(Optional) Node installed

Installation

Fork and clone the repo
Install the python dependencies with you'r favourite package management tools
(Optionally) use pdm.

Crawl and extract data

Crawling commands

To get the data locally run pdm run python -m src.lanz_mining.crawl -t <targetShow>. This project currently supports following targetShows:

markuslanz, maybritillner, carenmiosga, maischberger, hartaberfair
If you'r using it with a cronjob, use --lates-only-flag.

There's another option for ZDF-targetShows. Visit zdf-mediathek, find the search field and enter the name of you'r targetShow and hit the checkbox for 'ganze Sendungen' and load as many results as possible. Next save the html page and run pdm run python -m src.lanz_mining.crawl -t <targetShow> --file <htmlFile>.

Any of the combinations above will write found html files to outputs/html.

This will visit all urls found in the file and saves all episodes html files.

Extract data

When you got some html files ready, you need to run pdm run python -m src.lanz_mining.scrape_local --input-dir outputs/html --output-file data-processed.csv.

Information extraction is done with regexes to match certain indicators on e.g. roles. In cases where information is missing, scrape_local tries to find information in other formats using the guests name.

Contributions

Currently I'm looking to reduce the manual tasks more, so idealy everything runs automatically. To get this reliable, I'd be thankful for any hints on public APIs or other methods to map genres, identify party memberships and alike. Further I'd really be happy if you let me know what you think, DM me on chaos.social/@arrrrrmin, or open an issue to further improve things.

Name		Name	Last commit message	Last commit date
Latest commit History 153 Commits
.github/workflows		.github/workflows
exports		exports
notebooks/btw25		notebooks/btw25
outputs		outputs
src/lanz_mining		src/lanz_mining
tests		tests
web/lanz-mining-web		web/lanz-mining-web
.gitattributes		.gitattributes
.gitignore		.gitignore
.gitmodules		.gitmodules
LICENSE		LICENSE
README.md		README.md
pyproject.toml		pyproject.toml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Lanz Mining

What's it about?

Requires

Installation

Crawl and extract data

Crawling commands

Extract data

Contributions

About

Languages

License

arrrrrmin/lanz-mining

Folders and files

Latest commit

History

Repository files navigation

Lanz Mining

What's it about?

Requires

Installation

Crawl and extract data

Crawling commands

Extract data

Contributions

About

Topics

Resources

License

Stars

Watchers

Forks

Languages