Urdu, Kashmiri and Maithili Support #10

anuragshas · 2019-04-12T06:54:34Z

I would like to contribute for Urdu and Kashmiri language which are also one of the official languages in India and has Indian origins.
I have started working on Urdu Language using repo NLP for Marathi. I have gathered around 350K wikipedia articles link and in process of scraping those articles. I have also added multiprocessing support for gathering articles

goru001 · 2019-04-12T16:47:37Z

Thanks for the initiative, shout out if you need any help!

anuragshas · 2019-04-12T18:51:20Z

I am falling short of memory while creating TextLMDataBunch with only 100K articles and 32K vocabulary. How much memory is required to create the data for language model?

goru001 · 2019-04-13T10:25:34Z

It's advisable to not go beyond vocab length of 30k. Are you talking about GPU memory ? I've GTX 1080 Ti with 11 GB memory on which I trained all the models. You can train your models on Google Colab if you're having trouble with memory on your own gpu. If you were talking about RAM, i think 16gb should be fine!

anuragshas · 2019-04-15T10:18:35Z

Thank you for the information. The issue was that single file was having over 350K character which was unable to tokenized and numericalized at once and loaded into main memory so I had to select fewer sentences and it worked.
I have also created LM for Maithili language having perplexity of 50. In search of news for classification task.
Side by side I am training on Urdu language with 150K articles.
Kashmiri Language has only 350 articles and I think it won't be enough to create language model

goru001 · 2019-04-15T13:11:24Z

Good to know that it worked.
Yes, 350 articles seems too less. Try if you can get data from somewhere else.. news articles/govt. Websites etc.

anuragshas · 2019-04-18T08:38:32Z

I have completed for Urdu and here is the link

Resources for Kashmiri language is very scarce and some of them are paid, there are epaper websites having images. I am searching for more resources if possible or else I will have to drop it.

I am working on scraping maithili language news websites

goru001 · 2019-04-23T18:55:03Z

@anuragshas Thanks for the contribution! Would you like to raise a PR to add your model to iNLTK (I can help you with the process)

anuragshas · 2019-04-23T19:22:26Z

You are welcome. I am really happy that I will be able to raise my first PR on github.
After going through the the code, I guess i will have to change config.py file but I am in doubt will I have to fine tune the all_languages_identifying_model

goru001 · 2019-04-24T06:09:31Z

@anuragshas don't worry about all_languages_identifying_model. I will be fine tuning it to add Tamil language to iNLTK , I will tune it for Urdu as well.

As far as LM is concerned, we'll not be able to add it to iNLTK in the current form. Can you follow these instructions and upload the saved model on dropbox and share it's link with me?

Shout out if you need any help!

goru001 · 2019-04-28T05:31:12Z

@anuragshas You've been working on LM for Maithili as well, right? Can you share the Wikipedia Dataset you would've prepared for it? Because tuning language-classifier model again for Maithili will take time, I was about to train it for Tamil, Urdu, Telugu, so thought if we can add Maithili as well, that would be great!

anuragshas · 2019-04-28T05:58:15Z

@goru001 Here is the link of MaithiliWikiArticles.
I have been busy searching for job, I will create PR for urdu lm as soon as I get free.

goru001 · 2019-04-28T06:05:44Z

@anuragshas No issues! Good luck :).

anuragshas · 2019-05-16T12:04:49Z

@goru001 I have uploaded the urdu model using the instructions mentioned here. Please let me know what changes shall I make to create PR

goru001 · 2019-05-16T13:11:52Z

@anuragshas Once you've imported the Tokenizer, you need to load the pretrained model which you would have saved the last time, and then export. That is, just to be very clear, you don't need to retrain, just do learn.load('your_saved_model_after_training.pth')
And then,
learn.export('export.pkl').

Once you have this 'export.pkl' and 'tokenizer.model'(which is a result of unsupervised training by sentencepiece), upload these to Dropbox, and then,

Add your language and language code to config file
Add both the links to config file
The model I've pushed in today does contain language identifying capabilities for Urdu. So, nothing to do on that front.
Check that all the functions are running fine for Urdu.

anuragshas · 2019-05-16T19:37:31Z

pycache and .idea folder is already present to the repo shouldn't that be removed?

goru001 · 2019-05-17T01:52:31Z

yes, you're right. That was a mistake when I'd first committed. I've removed those now! Thanks!

ankur220693 · 2019-08-13T06:37:39Z

Thank you for the information. The issue was that single file was having over 350K character which was unable to tokenized and numericalized at once and loaded into main memory so I had to select fewer sentences and it worked.
I have also created LM for Maithili language having perplexity of 50. In search of news for classification task.
Side by side I am training on Urdu language with 150K articles.
Kashmiri Language has only 350 articles and I think it won't be enough to create language model

I am working on Maithili.
Does inltk now supports maithili ?

goru001 · 2019-08-15T10:12:39Z

@ankur220693 not yet! Feel free to contribute and raise a PR. Let me know if you need some help along the way.

anuragshas · 2019-08-19T10:16:17Z

@ankur220693 I am actually short of data for working on maithili. The model that I had created was overfitting therefore I had to put it on hold. If you can help us gather maithili text from a reliable source then please let us know

Akshayanti · 2020-09-23T10:28:39Z

@anuragshas would you have any updates regarding language resources for Kashmiri?

anuragshas · 2020-09-23T11:22:28Z

For Kashmiri there is not enough data available publicly to work on, check on Oscar or Wikipedia dump if there is data available. Last time I had scraped it was only 350 articles

Iflaq12 · 2023-04-05T16:48:02Z

Dear @anuragshas I think Kashmiri Wikipedia has increased in size now.

anuragshas changed the title ~~Urdu and Kashmiri Support~~ Urdu, Kashmiri and Maithili Support Apr 15, 2019

goru001 added the enhancement New feature or request label Apr 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Urdu, Kashmiri and Maithili Support #10

Urdu, Kashmiri and Maithili Support #10

anuragshas commented Apr 12, 2019

goru001 commented Apr 12, 2019

anuragshas commented Apr 12, 2019

goru001 commented Apr 13, 2019

anuragshas commented Apr 15, 2019

goru001 commented Apr 15, 2019

anuragshas commented Apr 18, 2019

goru001 commented Apr 23, 2019

anuragshas commented Apr 23, 2019

goru001 commented Apr 24, 2019

goru001 commented Apr 28, 2019

anuragshas commented Apr 28, 2019

goru001 commented Apr 28, 2019

anuragshas commented May 16, 2019

goru001 commented May 16, 2019 •

edited

Loading

anuragshas commented May 16, 2019

goru001 commented May 17, 2019

ankur220693 commented Aug 13, 2019

goru001 commented Aug 15, 2019

anuragshas commented Aug 19, 2019

Akshayanti commented Sep 23, 2020

anuragshas commented Sep 23, 2020

Iflaq12 commented Apr 5, 2023

Urdu, Kashmiri and Maithili Support #10

Urdu, Kashmiri and Maithili Support #10

Comments

anuragshas commented Apr 12, 2019

goru001 commented Apr 12, 2019

anuragshas commented Apr 12, 2019

goru001 commented Apr 13, 2019

anuragshas commented Apr 15, 2019

goru001 commented Apr 15, 2019

anuragshas commented Apr 18, 2019

goru001 commented Apr 23, 2019

anuragshas commented Apr 23, 2019

goru001 commented Apr 24, 2019

goru001 commented Apr 28, 2019

anuragshas commented Apr 28, 2019

goru001 commented Apr 28, 2019

anuragshas commented May 16, 2019

goru001 commented May 16, 2019 • edited Loading

anuragshas commented May 16, 2019

goru001 commented May 17, 2019

ankur220693 commented Aug 13, 2019

goru001 commented Aug 15, 2019

anuragshas commented Aug 19, 2019

Akshayanti commented Sep 23, 2020

anuragshas commented Sep 23, 2020

Iflaq12 commented Apr 5, 2023

goru001 commented May 16, 2019 •

edited

Loading